Tuesday, September 15, 2020

Re: [google-cloud-sql-discuss] Re: Scaling postgres updates on app engine

Thank you David!

I wound up doing a fun benchmark to test different methods for create of the matrix, and indeed using copy from directly is very speedy - the original task that took ~30 hours went down to 20 minutes. What took a long time is generating the input file to run that task, which still was almost 6 hours, and that's an underestimate since I just grab a simulated value - the real calculation would need time to actually be done.

Cloud Functions are serverless and meant for short lived tasks so as long as these tasks are meant to be short lived (less than 10 minutes), then Cloud Functions with Cloud Tasks could work instead of using GKE. You could also take a look at Cloud Run.

I know you are asking about Cloud SQL scalability, but have you considered Big Table? (if your data is set to grow to over 100GB)

That's a good suggestion! What about this for an idea? If we have a scheduled job that kicks off scaled / parallel tasks (possibly with Cloud functions) to ultimately write one value each to a row in Big Table, we could then export to text for a (reasonably fast) ~20 minute import. It seems a bit redundant because we are just writing to a different database, but we wouldn't be stressing the application to do the calculations, or the database with many small calls. The goal would be to have the entire thing done in an hour or so, run once a week.
 
Best,

Vanessa

On Wednesday, September 9, 2020 at 4:40:30 PM UTC-4 vso...@gmail.com wrote:

Hey folks!

I'm working on a standard environment app engine Django application (python 3.7) where we have a few similarity matrices that will warrant complete updates regularly (weekly or monthly). For example, one is about 12K by 12K, and the other is 16K by 16K. These are similarity matrices, and in that the data to compare is not changing and each cell in the matrix (a score and other metadata) is it's own model instance, the operations can be considered many small tasks. What I'm wondering about are different strategies for scaling the operation, primarily to make it faster or more efficient. Running one round of updates (basically iterating through the diagonal of the matrix) in serial takes about a day.

I'm going to cross post this with both app engine and managed sql, so apologies for the double post! I've only started exploring ideas, and I've been looking at the task queue, and cloud functions, and I'm thinking some strategy that can submit a bunch of jobs to a queue to be processed, and then have (some maximum number) of connections to update allowed at once. Batch seems like a lot of overhead (and expense) to just update the matrices for a tiny application, but I haven't tried it. We would need to think about access permissions for this task to do the update (directly or indirectly). I'm not sure if this kind of operation would require horizontal scaling. I want to find a solution that isn't hugely complex, so it's easy to reproduce in the future. Thank you!

Best,

Vanessa
--
You received this message because you are subscribed to the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-sql-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/97177f67-1d71-4a8f-baf9-22723ea9b77an%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-sql-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/CAM%3Dpu%2BLsfnuaXSZ%3DfRfKufG7srDPN9_G-C%2BPUskJkvtVu0dpSw%40mail.gmail.com.

No comments:

Post a Comment