Tuesday, May 22, 2018

[google-cloud-sql-discuss] Re: Postgres Connections Randomly Dropping

For anyone who is maybe experiencing a similar situation, here's what the problem was!

Dinesh pointed out that the OperationalError and connection issues that we were experiencing coincided with kubernetes node pool autoscaling events. I realized that when the GKE cluster autoscaled, pods are sometimes rescheduled on different nodes and some of the cloudsql proxy pods were terminated (specifically, we were running a DaemonSet and when the node pool downsized, cloudsql proxy pods would terminate). However, we have very long running tasks that don't complete before the termination grace period of the cloudsql proxy pods ends. So the pod was being removed and the connection to the database between corresponding long-running tasks and the cloudsql proxy pods was breaking prematurely. Issue was resolved by extending the termination grace period and adding a preStop command to delay the tear down of the cloudsql pods so that all the connections can drain before the pods are removed.

Nigel G.

On Friday, 11 May 2018 15:40:00 UTC-7, Dinesh (Google Platform Support) wrote:
Hi Nigel, 

Yes I Received your private message and reverted you on that thread. 

Regards,

On Friday, May 11, 2018 at 1:59:17 PM UTC-4, Nigel Gutzmann wrote:
Hi Dinesh, did you get my private reply? Just want to make sure that I sent it correctly, not sure if I'm using this group exactly correctly.

Nigel G

On Tuesday, 8 May 2018 13:31:31 UTC-7, Nigel Gutzmann wrote:
I sent a private reply, thanks Dinesh

On Thursday, 26 April 2018 13:49:39 UTC-7, Dinesh (Google Platform Support) wrote:
Hi Nigel,

Thanks for the feedback. For Postgres SQL instance, fail-over instance is located in the different zone with-in the same configured region, so if the master cannot serve data from its primary zone, it fails over and continues to serve data from its secondary zone. As per the cloud documentation, if cloud SQL instance configured for HA experiences an outage or becomes unresponsive, Cloud SQL automatically switches to serving data from the secondary zone. It is recommended to configure all of your instances that contain production data for HA.

Regarding the connection problem, please check and verify the following points:-

1. Default TCP/IP port on which the postgres server listen is tcp:5432. Any specific reason why you using port 3306 which is default port for MySQL DB? Please correct this if it is a mistake.

2. Please note that there is Maximum concurrent connections limit for Cloud SQL based upon instance memory. Please check if your instance is not going beyond the connection limit during peak hours.

3. Please refer to this StackOverflow thread for the similar problem and suggested solutions for the troubleshooting. 

If above does not help, please share your project number, PostgreSQL instance name, Instance operations logs, and PostgreSQL error logs along with issue date, time and duration via a private email.

Regards,
 
On Wednesday, April 25, 2018 at 5:38:43 PM UTC-4, Nigel Gutzmann wrote:
Thanks for the suggestion Dinesh, but I don't think that's it. We are getting the errors much more frequently than there are events in the operations tab. Typically we get 2-3 bursts of errors per day, but there is generally just one operation: a backup. The timing also doesn't necessarily coincide between the events and the errors. Regardless, if the operations were a problem, would turning on HA fix the issue, or would that not help?

Any other ideas about why we could be seeing these connection problems?

Nigel G.

On Wednesday, 25 April 2018 14:17:18 UTC-7, Dinesh (Google Platform Support) wrote:
As you suggesting you receive such errors only for 2-3 minutes in a day, I suspect your instance might be going through maintenance updates (that require an instance restart) during that times.  Please view operational logs of your Postgre SQL instance. You can view them from cloud console GUI inside instance details view, under operations tab. If you find instance was updated at the same time, that explain the cause of these logs. 

If that is the real cause of the mentioned errors, I will recommend configuring the schedule for the Maintenance window and Maintenance timing to avoid any surprises in the future. 

Let me know if this helps?

Regards,


On Wednesday, April 25, 2018 at 12:59:51 PM UTC-4, Nigel Gutzmann wrote:
I have a django and celery application running inside of Google Kubernetes Engine. I am connecting to my CloudSQL instance (postgres) using a Kubernetes service running the CloudSQL Proxy. Database connections and queries generally work fine, but occasionally we get spurts of errors with connections breaking. They are raised in python like this:

OperationalError: could not connect to server: Connection refused Is the server running on host "cloudsql-proxy-service" and accepting TCP/IP connections on port 3306?

or

OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.

I can't find anything that might cause that in the logs of the CloudSQL instance. There are some messages like this in the CloudSQL proxy logs:

2018/04/24 18:55:18 Instance <project_name>:us-central1:<instance_name> closed connection

But I can't necessarily correlate the timestamps between when those messages appear and when we get the python errors. I have tried setting CONN_MAX_AGE and tcp keepalives like this inside django's settings.py:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': '<db_name>',
        'USER': os.environ.get('DB_USER', None),
        'HOST': os.environ.get('DB_HOST', None),
        'PORT': os.environ.get('DB_PORT', None),
        'PASSWORD': os.environ.get('DB_PASSWORD', None),
        'CONN_MAX_AGE': int(os.environ.get('CONN_MAX_AGE', 0)),
        'OPTIONS': {
            'keepalives': 1,
            'keepalives_idle': 480,
            'keepalives_interval': 10,
            'keepalives_count': 3,
        },
    },
}

But that didn't seem to make a difference. We still get the same errors in bunches, about 20 errors over the span of 2-3 minutes, 2-3 times per day.

--
You received this message because you are subscribed to the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-sql-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/af3faf5f-36d3-4f4f-853f-650ad662b146%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment