December 14, 2018 • Mahmoud Taha

Cloud Functions + BigQuery = Data Feed Automation

Reading Time: 5 minutes

The efficiencies of automation and hosted solutions are compelling for many enterprises.

Google Cloud Functions constitute an event-driven serverless compute platform that gives you the ability to run your code in the cloud without worrying about the underlying infrastructure.

Cloud Functions provide the following fundamental advantages:

In most cases when you want to run any kind of enterprise code for computing or operational functions, you need to have a virtual machine hosting and running your code. Cloud Functions avoid the effort and complexity of maintaining your own virtual machines.
Hosting your own infrastructure can be costly relative to the benefit, especially if you only need to run the code a few times a day. Cloud Functions, on the other hand, are ephemeral, spinning up and back down on demand, thereby maximizing efficiency and cost-effectiveness as you take advantage of billable resources. You only pay for the time it takes your code to run.
Related to the previous bullet, you can configure Cloud Functions to fire in response to events in the environment, thereby reducing or eliminating the need for manual activation
My favorite part is that now you can write our code in Python 3 (beta) and of course JavaScript (Node.JS).

The Use Case

In our use case we’re going to use the event-based trigger in Cloud Functions to:

fire up a function once the GA 360 BigQuery export creates the ga_session_YYYYMMDD table for the day before
then use this table to generate some custom reports
then export these reports to csv files on Cloud Storage Bucket

Why not Dataprep?

So, why not just use Dataprep to run BigQuery views and schedule a Dataflow job to fire at 12:05 am after the GA table is created, then save the results to Cloud Storage with no coding (except the SQL queries of course)? The issue here is that if for any reason the GA BQ export got delayed the BQ views will fail causing your job to fail. Using a Stackdriver trigger is a more failsafe approach. The trigger will only fire once the table is created, eliminating the timing dependency and ensuring that the Cloud Function will find the table when executing the queries.

But can Cloud Functions be triggered by BigQuery?

Actually no, Cloud Functions can’t be triggered with BigQuery. Still, there is a way to work around this using Stackdriver and Pub/Sub (which is one of the source triggers of Cloud Functions). In our use case we’ll be using more than one product from the GCP family like Stackdriver, Pub/Sub, Cloud Functions (Python), Cloud Storage and of course BigQuery (not necessarily in that given order).

It looks like a lot of work!

Well, it may look like there are a lot of products used here, but really almost all the work is done in Cloud Functions and BigQuery, the rest of the products are just helping to close the loop with minimal configurations.

Putting it all to work

So, let’s list all the steps needed to get the job done, then get into details one by one:

Create a new Pub/Sub topic to set as the sink for the Stackdriver export.
Setup a Stackdriver Logging Export, in which we define a filter to monitor BigQuery logs and fire-up when the GA table is created.
Write the BigQuery queries we need to use to extract the needed reports.
Create a new Cloud Function and choose the trigger to be the Pub/Sub topic we created in Step #2.
Write a Python code for the Cloud Function to run these queries and save the results into Pandas dataframes.
Finally, write the dataframes into CSV files in Cloud Storage.

We’ll not go deep into the functionality of each GCP product. We’ll just review what we need to configure to have our use case working.

The help docs provide additional discussion on the Google Cloud Platform components used in this solution.

Pub/Sub & Stackdriver

First we create a new Pub/Sub topic (which can be done implicitly while creating the Stackdriver export, as shown below). Then we’ll make use of the fact that BigQuery logs are active by default, and define an export in Stackdriver with a filter for only when the GA table is created in BigQuery and configure the Stackdriver export to send the body of this log (which is in JSON format) to our Pub/Sub sink.

Preparing the BigQuery queries

In this step we prepare the BQ queries that will be used to produce the needed reports. Without getting into too much explanation about how to write the BigQuery queries, we’ll use the query below, which retrieves all sessions from the day before that included Add to cart eCommerce action, with all details about the products returned in the query.

SELECT<br />CONCAT(fullVisitorId,'.', CAST(visitId AS string)) AS sessionId,<br />hit.page.pageTitle AS pageTitle,<br />CONCAT(hit.page.hostname, hit.page.pagePath) AS pageURL,<br />hit.page.hostname AS hostname,<br />product.productSKU AS productSKU,<br />product.v2ProductName AS productName,<br />product.v2ProductCategory AS productCategory,<br />product.productPrice/1000000 AS productPrice,<br />product.productQuantity AS productQuantity<br />FROM<br />`..ga_sessions_*`,<br />UNNEST(hits) AS hit,<br />UNNEST(hit.product) AS product<br />WHERE<br />hit.eCommerceAction.action_type = '3'<br />AND _TABLE_SUFFIX = FORMAT_DATETIME("%Y%m%d",<br />DATETIME_ADD(CURRENT_DATETIME(),INTERVAL -1 DAY))

Query retrieving product data for previous day’s sessions that included Add to Cart ecommerce action in Google Analytics.

Creating the Cloud Functions

Now, on to creating our Cloud Function. We need to define the function name, memory to allocate, trigger (Pub/Sub in our case), topic (our topic name), and for the runtime we’ll use Python 3 (which is currently is in beta). There is another option to use Node.JS, but we’ll stick to Python for now.

Python & Cloud Storage

Below is a snippet of the Cloud Function python code used to run, execute and export the BigQuery’s results into a CSV file into a Cloud Storage Bucket.

from google.cloud import bigquery<br />from google.cloud import storage<p>def export_to_gcs():<br /># BQ Query to get add to cart sessions<br />QUERY = "SELECT<br />CONCAT(fullVisitorId,'.', CAST(visitId AS string)) AS sessionId,<br />hit.page.pageTitle AS pageTitle,<br />CONCAT(hit.page.hostname, hit.page.pagePath) AS pageURL,<br />hit.page.hostname AS hostname,<br />product.productSKU AS productSKU,<br />product.v2ProductName AS productName,<br />product.v2ProductCategory AS productCategory,<br />product.productPrice/1000000 AS productPrice,<br />product.productQuantity AS productQuantity<br />FROM<br />`..ga_sessions_*`,<br />UNNEST(hits) AS hit,<br />UNNEST(hit.product) AS product<br />WHERE<br />hit.eCommerceAction.action_type = '3'<br />AND _TABLE_SUFFIX = FORMAT_DATETIME('%Y%m%d',DATETIME_ADD(CURRENT_DATETIME(),INTERVAL -1 DAY))"<br />bq_client = bigquery.Client()<br />query_job = bq_client.query(QUERY) # API request<br />rows_df = query_job.result().to_dataframe() # Waits for query to finish<br />storage_client = storage.Client()<br />bucket = storage_client.get_bucket('BucketName')<br />blob = bucket.blob('Add_to_Cart.csv')<br />blob.upload_from_string(rows_df.to_csv(sep=';',index=False,<br />encoding='utf-8'),content_type='application/octet-stream')<br />

Cloud Function python code, executed when the function is triggered

Here, we are using google.cloud.bigquery and google.cloud.storage packages to:

connect to BigQuery to run the query
save the results into a pandas dataframe
connect to Cloud Storage to save the dataframe to a CSV file.

The final step is to set our Python function export_to_gcs() as “Function to execute” when the Cloud Function is triggered.

How much data can this handle?

Say our data is in the volume of millions of records; we can always extend the memory allocated for our Cloud Function up to 2GB, but this comes with a higher price, and what if even 2GB is not enough?

Another workaround for this is not using Pandas to save query results. Instead, we can save the results to a BigQuery intermediate table, then export this table directly to Cloud Storage, letting BigQuery do all the heavy lifting for us.

Below are two functions to do so. The save_to_bq_table() function runs a query and saves the results to a BigQuery table, here we are setting allow_large_results = True to avoid job crashing if the result set is huge.

def save_to_bq_table():<br />bq_client = bigquery.Client()<br /># Saving data to a intermediate table then export it to GCS<br />query = "##Query with millions of records results##"<br />job_config = bigquery.QueryJobConfig()<br /># Set the destination table<br />table_ref = bq_client.dataset(dataset_id).table('TableID')<br />job_config.destination = table_ref<br />job_config.allow_large_results = True<br /># Start the query, passing in the extra configuration.<br />query_job = bq_client.query(<br />query,<br />location='US', # Location must match that of the source table<br />job_config=job_config) # API request - starts the query<br />query_job.result() # Waits for the query to finish<br />

Function to save query results to a BigQuery intermediate table.

The export_bq_table() function, exports the table to Cloud Storage CSV file(s), then deletes the table.

def export_bq_table():<br />client = bigquery.Client()<br />destination_uri = 'gs://{}/{}'.format('BucketName','ExportFileName_*.csv')<br />dataset_ref = client.dataset(dataset_id, project=project_id)<br />table_ref = dataset_ref.table(tableId)<br />extract_job = client.extract_table(<br />table_ref,<br />destination_uri,<br />location='US') # API request # Location must match that of the source table<br />extract_job.result() # Waits for job to complete.<br />client.delete_table(table_ref) # API request<br />

Function to export the BigQuery intermediate table to Cloud Storage and delete the table.

The exported files will have a limit of 1GB per file, so adding an asterisk * somewhere in the file name in the URI will generate multiple files with incremental files names, FileName-000000000000.csv, FileName-000000000001.csv, FileName-000000000002.csv and so on.

Summary

Although Cloud Functions can’t be used for complex transformations which is a task for Dataflow, Cloud Functions are also a very powerful tool that can be used alongside other GCP products to automate quick tasks with little code writing effort. They can be used for exporting data from BigQuery, writing data from Cloud Storage into BigQuery once files are put into a GS Bucket, reacting to a specific HTTP request, monitor Pub/Sub topics to parse and process different messages, and so much more.

Other use cases

Some other use cases of Google Cloud Functions include:

Author

Mahmoud Taha
Mahmoud is a Data Engineering Manager with over 15 years of experience in data engineering, data warehousing, and business intelligence. He specializes in designing and building scalable data pipelines using Python, SQL, and cloud technologies to support data-driven decision-making. Mahmoud has led cross-functional teams across digital marketing and telecommunications, translating complex business needs into robust technical solutions.
View all posts

Why AI Traffic Is Skewing Your Analytics (And How to Fix It)

Reading Time: 2 minutes

Rule-Based Analytics: Take Control Beyond Google Analytics UI

Reading Time: 4 minutes

Search+AI: AI Visibility Starts with Smarter SEO

Reading Time: 2 minutes

Popular Categories

Our Picks

Amplitude’s AI Agents Are Here, And They’re About to Transform How You Work

Reading Time: 3 minutes

Sense by Contentsquare: AI That Turns Experience Data Into Action

Reading Time: 3 minutes

Building a Data Pipeline in GCP for BigQuery Data Models

Reading Time: 3 minutes

GA4 Posts

Comparing Universal Analytics to Google Analytics 4: 15 Key Differences to Know

Reading Time: 10 minutes

How to Report on Google Optimize Experiments in GA4

Reading Time: 4 minutes

How to Adapt to No “View” Feature in Google Analytics 4

Reading Time: 4 minutes

Popular Tags

Google Marketing Platform Hub

Your one-stop-shop for everything Google Marketing Platform, designed to help marketers stay informed and up-to-date on product news, solutions, how-to’s, and more.

Mahmoud Taha

Mahmoud is a Data Engineering Manager with over 15 years of experience in data engineering, data warehousing, and business intelligence. He specializes in designing and building scalable data pipelines using Python, SQL, and cloud technologies to support data-driven decision-making. Mahmoud has led cross-functional teams across digital marketing and telecommunications, translating complex business needs into robust technical solutions.

The Use Case

Why not Dataprep?

But can Cloud Functions be triggered by BigQuery?

It looks like a lot of work!

Putting it all to work

Pub/Sub & Stackdriver

Preparing the BigQuery queries

Creating the Cloud Functions

Python & Cloud Storage

How much data can this handle?

Summary

Other use cases

Author

Mahmoud Taha

Locations

Follow Us

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Your request has been submitted and a rep will reach out to you shortly.

You may also be interested in...

Message Sent

Thank you for your interest.

Thank you for registering.

You should receive a confirmation email from GoToWebinar with your unique webinar login information. If you do not receive this email or have trouble logging in to the event, please email asmaa.mourad@cardinalpath.com.

Thank you for subscribing!

You're now looped into the world's largest GMP resource hub!

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Message Sent

Thank you for registering.

Thank you for your submission.

Message Sent

Thank you for registering.

Thank you for registering.​

Paid media spend by Government websites increased a whopping 139% YoY in 2020.

2020 Online Behavior Live Dashboard

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

2020 Online Behavior Live Dashboard

Thank you for your submission.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for your submission.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Thank you for registering.

Cardinal Path is continuing with its series of free training. Next we are conducting training on Google Data Studio. Check it out here.

Cardinal Path hosted a live session to connect with you and answer all your questions on Google Analytics.

Get all the expertise and none of the consultancy fees in this not-to-be-missed, rapid-fire virtual event.

Thank you for submitting the form.

Thank you for submitting the form.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Thank you for registering.