Moving Analytics Data Between Clouds
Written by Mira Daniels (Data Engineer @SumUp)
At SumUp we use Google Analytics and Firebase as analytics tools (not switched to GA4 yet), the raw data is exported to BigQuery, but we do not use BigQuery as our Data Lake or DWH solution. So we need to ingest data from one cloud into another and have to decide which tool (3rd party or not and if not which cloud’s tools) to use. Many Analysts depend on the GA data for their analyses, reporting, testing and marketing automation efforts. As always the best solution depends on the exact use case, available skills, time and financial resources.
As shown above what we needed was a data pipeline from BigQuery to Data Lake (AWS S3) and DWH (in our case Snowflake). We want to make the data available not just for analytical usage in the DWH but also for every other Data consumer that wants to access it from the Data Lake. As you can imagine, Google doesn’t make it very easy for you to move data out of their cloud. We also had a time constraint, 1 month, to have a process in place due to our contract renewal cadence with a 3rd party tool provider.
Our solution so far: individual teams, if equipped with Data Engineering capabilities, build and maintain Airflow pipelines moving the needed subsets of data from BigQuery to the DWH (no data being sent to the Data Lake). In addition, we introduced Fivetran using their Google Analytics 360 connector to move GA data from BigQuery to their Snowflake Destination. The connector takes the data from BigQuery which is great as Analytics are already familiar with the schema and do not need to use e.g. the schema that the Analytics Reporting API provides.
The following factors should be considered when working with Fivetran:
- Unlike BigQuery, Fivetran unnests the data and splits it into different tables (Sessions scope, Hit scope and Product scope), which means that the queries you used in BigQuery cannot be used on the exported data.
- It only supports intraday tables but not realtime tables, so depending on when your data is being imported to BigQuery and you depend on real time data you are forced to wait for the full-day table to arrive to move the data with Fivetran.
- The destination in Fivetran is Snowflake but we also have data consumers that want to access it via our Data Lake, e.g. for Data Science, and Fivetran does not support a S3 destination yet.
- Fivetran is not free. Moving all of our web analytics data gets expensive since you pay per MAR, and the records in BigQuery are all being unnested by Fivetran.
Unfortunately Fivetran does not offer a Firebase connector yet but this is planned for Q3 2022. The alternative would be to connect to Google Cloud Storage after having moved the Firebase data from BigQuery there (e.g. via Cloud function).
Looking into alternatives to Fivetran we considered the following factors:
- Cost
- Size of data chunks loaded at once
- Dependencies on APIs and GCP
- Completeness of data in AWS S3 (raw data in JSON or parquet)
- Schedule options for data sync
- PII scanner for raw data
- Observability of the process
- Encryption
- Backfilling of historical data
- Handling of secrets and service account credentials
We tested different alternatives, some no-code ones like Appflow and Airbyte (once set up) and Airflow, which offers a Google Cloud Storage to S3 operator. Appflow had the obvious limitation of using the GA API, which meant that the data was not raw (sampling, not all dimensions being available) and each API call could only pull 9 dimensions. Airbyte proved to be unreliable to us and was inflexible in its sync schedule and dataset input values for BigQuery.
We also did simple tests with the Google Cloud Shell:
gsutil rsync -r gs://your-gs-bucket/your-extract-path/your-schema s3://your-aws-bucket/your-target-path/your-schemaThis approach works if you want to move files once but didn’t seem to be a suitable process for us since you cannot customise the command as needed in order to control the factors listed above.
After making progress with Airflow and AWS Glue we decided to focus on Glue and to not explore other options due to our time limit. The flexibility that Glue offers in modifying the connector and script could not be matched using the out of the box Google Cloud operators Airflow provided. Glue is a tool enjoying all benefits of the AWS ecosystem, which we also chose for our Data Lake as mentioned earlier (i.e.secrets do not need to be shared with other tools/clouds). Also, pipeline jobs can be scripted flexibly using PySpark, and we could automatically store the metadata (Data Catalog) and define partitioning of the data in S3. Glue can be used no-code as well, but since we wanted our Firebase data to be overwritten for the last 72h (this is how far back the data can change in BigQuery due to e.g. “offline” data generated with lost internet access) we had to use a custom script.
The script below is not the full script but it shows the main components.
def for_date(year, month, day):
    GoogleBigQueryConnector0220forAWSGlue30_node1 = (
        glueContext.create_dynamic_frame.from_options(
            connection_type="marketplace.spark",
            connection_options={
                "parentProject": "your-bq-project",
                "table": "your-dataset.events_" + year + month + day,
                "connectionName": "connection-name",
            },
            transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
        )
    )
   
    dyf = GoogleBigQueryConnector0220forAWSGlue30_node1
    df = dyf.toDF()
    if spark.catalog._jcatalog.tableExists("your-aws-dataset.table-name") == False:
         
       
            df = df.withColumn("year", f.lit(year)).withColumn("month", f.lit(month)).withColumn("day", f.lit(day))
            df.write.partitionBy("year", "month", "day").mode("overwrite").saveAsTable("your-aws-dataset.table-name", path="s3://your-s3-path/")
   
    else:
       
        df = df.withColumn("year", f.lit(year)).withColumn("month", f.lit(month)).withColumn("day", f.lit(day))
        df.createOrReplaceTempView('delta')
   
        spark.sql(f'insert overwrite table fb_events.firebase PARTITION(year,month,day) select * from delta')
        job.commit()The next step for us is to move the data from our Data Lake to Snowflake in order to make it accessible for all analytical use cases. This will be done using External Tables that reference S3. The scheduling of the tasks (Glue job creation, update, run and Snowflake refresh) will still be done with Airflow using boto3 with the python operator and snowflake operator. Depending on your Airflow version you can also use the glue job operator.
I would love to discuss the problems we faced and other solutions to it with you at the MeasureCamp Berlin next month!
 
        