Thursday, June 28, 2012

Re: Troubleshooting "too many concurrent requests"

Hi Ken,

Thanks. The 200+ clients rarely trigger a DB call now, as I'm using
the datastore for caching data and have cron jobs that periodically
flush data to/from GC-SQL and datastore. The DB hangs under a load of
under 1 QPS (according to the GAE dashboard, though I know those
numbers aren't one-to-one for GC-SQL queries) and remains hung even
when there's only a single client in 10 minutes trying to access the
DB. I.e. once the connections are maxed out, none seem to time out in
a quick enough fashion for the system to heal itself.

The application is a research project that involves worker nodes and
the GAE/GC-SQL app as the control master that allocates work and
aggregates results. I can give you more specific detail off-channel,
if that might help. It's a university/Google collaboration, actually.

In general, things work for long stretches now, except that every x
number of hours I come back to find the DB hung with maxed out
connections ("Instance has too many concurrent requests: 101"), which
I can only resolve by restarting my instance manually via the APIs
console. I'm unable to see what the maxed out connections are doing,
as neither the GC-SQL APIs console nor commandline tool can even
connect to the DB to poke around when the connections are maxed out.
There's probably some deadlock scenario that creeps up occasionally,
or otherwise it's the case of random load occasionally leading to a
QPS spike that causes a chain of events (i.e. GAE deadlines --> DB
connections hanging --> GAE using up all 100 connections --> splat).

I think I'd like to take your advice and write another cron job that
kills GC-SQL processes that are stuck waiting too long when the DB
appears down. But, this job would still fail to work once all 100
connections are established (after all, it needs a connection itself
to see the processlist and execute the kill -- same issue that makes
the APIs console mostly useless in this scenario). I'd have to run it
very often to have a chance at being useful. So, this brings me to
another related question:

Is there any programmatic introspection/management available for
GC-SQL instances? E.g. is there a better way to detect and/or
troubleshoot the type of scenario I'm running into? It seems that if
there's an APIs console for it that the team could code up a meta API
for controlling the instance. For example, if I could have GAE calls
that detect the state of the DB, check the number of connections,
(maybe even get the min,max,avg process connection time), test when
the DB fails to load (as the APIs console typically reports under the
above scenario), and allow me to make a programmatic call to restart
the instance, that would be very useful.

Relatedly, is there any logging visible for the GC-SQL itself?
There's a "Logs" tab in the APIs console, but that only shows when I
first created the instance. It could be a bit more verbose. :-)

Thanks for your time and help!
Jake





On Mon, Jun 25, 2012 at 7:42 PM, Ken Ashcraft <kash@google.com> wrote:
> On Sat, Jun 23, 2012 at 5:46 AM, Jake Czyz <jjc.work@gmail.com> wrote:
>>
>> Hi Ken,
>>
>> I'm the one that asked both of those questions. :-)
>>
>> I still hit the limits and my application becomes unusable after a
>> while with everything waiting for the SQL processes that are hung and
>> using up the 100 connection limit it seems.  It's only after I restart
>> my instance in the API console that things start working again.  Is
>> there a way to programatically restart the instance?
>
>
> No, sorry.
>
>>
>> Is there a way to
>> reduce the timeout on the connections so that it's the same/less than
>> the GAE timeout?
>
>
> Not really.  You can reduce the deadlock timeout within mysql.
> http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout
>
> You could also set up a cron job to kill long-running queries.
>
>>
>> Other ideas?  I can't be the only one with 200+ clients that
>> occasionally trigger GC-SQL queries.   I have a hard time believing a
>> Google cloud product isn't designed to scale to my miniscule load.
>>
>
> 200+ simultaneous clients sounds like thousands of QPS.  I wouldn't call
> that miniscule.  Cloud SQL doesn't do anything magical to make your database
> scale.  It is a mysql server running on fault tolerant storage.  No
> autosharding or anything like that.  Keep in mind that because we do
> synchronous replication, write latencies are higher than what you'd get with
> a local mysql.  That might cause you to hold locks longer than you'd expect.
>
> We can increase the limit, but every time we've seen somebody run into this
> situation, increasing the limit would just make things worse.  More
> simultaneous queries would mean more contention and more timeouts.
>
> What is your application?
>
> Ken
>
>>
>>
>> On Fri, Jun 22, 2012 at 4:12 PM, Ken Ashcraft <kash@google.com> wrote:
>> > These might help:
>> >
>> > http://stackoverflow.com/questions/10424442/what-are-the-connection-limits-for-google-cloud-sql-from-app-engine-and-how-to
>> >
>> > http://stackoverflow.com/questions/10158805/whats-a-good-approach-to-managing-the-db-connection-in-a-google-cloud-sql-gae/10438622#10438622
>> >
>> >
>> > On Fri, Jun 22, 2012 at 2:48 AM, Jake Czyz <jjc.work@gmail.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I've been working on a GAE / Cloud SQL app for many weeks and keep
>> >> hitting the wall on various GC-SQL limits that have forced me to
>> >> refactor my app several times.  A persistent one I'm running into is
>> >> the error:
>> >>
>> >> google.storage.speckle.python.api.rdbms.0: Instance has too many
>> >> concurrent requests: 101
>> >>
>> >> Does anyone have any advice for troubleshooting this error or
>> >> otherwise coding database connections in a way that
>> >> minimizes/mitigates this limitation?
>> >>
>> >> My app has a couple hundred clients, and I've already had to refactor
>> >> it substantially to use the datastore instead of GC-SQL for most
>> >> interaction with clients (data is then synced to-from GC-SQL via cron
>> >> jobs... which also are exceeding the 60 second Deadline occasionally,
>> >> but that's another issue...).  Right now, there's only a single type
>> >> of request that my clients still perform that needs to access the
>> >> GC-SQL DB directly (as opposed to datastore), due to consistency
>> >> requirements. This request happens, ballpark, once every 30 seconds or
>> >> so, on average, though sometimes there are a bunch in a row (i.e. one
>> >> every couple seconds for a minute or two).
>> >>
>> >> I can't even view processes (i.e. via "show processlist") and kill
>> >> threads via the Google APIs console SQL Prompt, as it displays "Failed
>> >> to load databases."   I know that restarting my instance (and waiting
>> >> several minutes) usually will get me back on track, but I've gotten
>> >> frustrated and am wondering what other strategies or ideas folks have
>> >> for dealing with the database connection limits.
>> >>
>> >> Thanks in advance,
>> >> Jake
>> >
>> >
>
>

No comments:

Post a Comment