Wednesday, October 24, 2018

[google-cloud-sql-discuss] ETL approaches to bulk load data in Cloud SQL

I'm not really sure where I should ask these types of questions, but this one is a duplicate of a question that I posted on StackOverflow: https://stackoverflow.com/questions/52973508/etl-approaches-to-bulk-load-data-in-cloud-sql


Since this is a discussion forum, maybe someone could help me.


I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.


My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.


Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.


In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.

Giving all this context, I can see two approaches:


1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.

2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.


The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.


The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.


Is there something I am getting wrong here? Can someone enlighten me?

--
You received this message because you are subscribed to the Google Groups "Google Cloud SQL discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-sql-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-sql-discuss/703173ed-40a3-45c8-9993-14cf9e3426d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

No comments:

Post a Comment