Cardinal Path

Building a Data Pipeline in GCP for BigQuery Data Models

In today’s data-driven world, ensuring that data is ingested, transformed, and delivered efficiently is critical for business operations. A well-designed CI/CD (Continuous Integration/Continuous Deployment) pipeline automates data transformations, ensuring consistency, accuracy, and scalability. 

In this post, we’ll walk through how to build a data pipeline in Google Cloud Platform (GCP) using BigQuery, Pub/Sub, Dataform, Workflows, and GitHub.

Why We Built This Pipeline

Many organizations rely on data flowing into BigQuery from multiple sources, either directly or via Cloud Storage as a staging area. However, managing transformations manually can lead to inefficiencies, delays, and errors. Our goal was to automate data transformations in BigQuery using a structured workflow, ensuring that:

  • Data transformations are triggered automatically upon new data ingestion.
  • The pipeline runs efficiently without manual intervention.
  • Business users always have access to the latest processed data via visualization tools like Looker or Tableau.

Technical Solution: How It Works

Data Ingestion and Logging

Data enters BigQuery through various sources:

  • Cloud Storage as a staging area, where data is temporarily stored before loading into BigQuery.
  • Direct ingestion from external platforms.

A Cloud Log Router detects new data ingestion events and sends a message to a Pub/Sub topic.

Triggering the Workflow

Once the Pub/Sub topic receives a message, it triggers a Cloud Workflows execution, which acts as the central orchestrator for our pipeline.

What is Dataform?

Dataform is a modern tool for managing SQL-based data transformations in BigQuery. It enables teams to organize, version control, and automate data pipelines through modular SQL workflows that are easier to maintain, review, and scale. Unlike traditional approaches that rely on scheduled queries or custom Cloud Functions, Dataform provides a centralized, Git-integrated workspace where dependencies are clearly defined and execution is streamlined.

By adopting Dataform, you can simplify orchestration logic, and gain visibility into your data transformation processes.

Executing Transformations with Dataform

Workflows execute a GitHub repository’s main branch, which is linked to a Dataform workspace. Dataform then:

  • Runs a set of dependent BigQuery scripts to clean, transform, and load data into a final BigQuery model.
  • Ensures that transformations follow best practices and version control.

Notifications & Monitoring

Upon completion of the data transformation process, the workflow sends a notification to a Slack or Microsoft Teams channel, informing stakeholders that the job has successfully run, or get informed if any errors/warnings happened during pipeline execution.

Business Intelligence & Visualization

Finally, business users can analyze the transformed data using dashboards in Looker or Tableau, ensuring real-time access to insights.

Benefits of Building the Pipeline in GCP:

  • Scalability: GCP services like BigQuery and Pub/Sub are designed to handle large-scale data with ease.
  • Serverless architecture: Reduces infrastructure management overhead.
  • Tight integration: Seamless connectivity between services like Workflows, Dataform, and Cloud Logging.
  • Cost efficiency: Pay-as-you-go model enables cost-effective operations.
  • Security and compliance: Built-in features for IAM, auditing, and compliance.
  • Business users have continuous access to updated insights.

Example Use Case: Retail Seller

A large online retail seller collects data from various sources such as: website activity logs, payment processors, inventory management systems, and marketing platforms. These data sources push raw data into Cloud Storage or directly into BigQuery.

With this automated data pipeline:

  • New orders and customer interactions are ingested periodically.
  • The Cloud Log Router detects new data loads and triggers the transformation pipeline.
  • Dataform transforms raw sales, product, and customer data into well-modeled BigQuery tables.
  • Business intelligence dashboards in Looker or Tableau update automatically, giving sales teams, inventory managers, and marketing leads immediate access to insights.

This enables the retail seller to make rapid, data-driven decisions, like targeting high-value customers, or identifying underperforming campaigns without relying on manual data processing or delayed reporting.

Pros:

  • Operational visibility: Stay on top of trends and customer behavior.
  • Targeted marketing: Use customer and sales data to fine-tune campaigns.
  • Accurate reporting: Eliminate discrepancies caused by manually stitched data or stale extracts.
  • Simplified maintenance: Replace scattered custom scripts and isolated scheduled queries with a scalable, version-controlled transformation pipeline.

By leveraging cloud products your organization can streamline data transformation processes, reduce manual effort, and improve data reliability.

Would you like to explore any specific part of the pipeline in more detail? Connect with us to learn more and explore how we can help you clean and analyze your data.

Author

  • Mahmoud is a Data Engineering Manager with over 15 years of experience in data engineering, data warehousing, and business intelligence. He specializes in designing and building scalable data pipelines using Python, SQL, and cloud technologies to support data-driven decision-making. Mahmoud has led cross-functional teams across digital marketing and telecommunications, translating complex business needs into robust technical solutions. 

    View all posts
Mahmoud Taha

Mahmoud is a Data Engineering Manager with over 15 years of experience in data engineering, data warehousing, and business intelligence. He specializes in designing and building scalable data pipelines using Python, SQL, and cloud technologies to support data-driven decision-making. Mahmoud has led cross-functional teams across digital marketing and telecommunications, translating complex business needs into robust technical solutions. 

Share
Published by
Mahmoud Taha

Recent Posts

Back to The Future: Time to Move Beyond Cookies

Google announced in late April that it will not move forward with creating an in-browser…

2 weeks ago

Composable CDPs: The Future of Customer Data

Rethinking the Customer Data Platform Customer Data Platforms (CDPs) were built to bring all your…

3 weeks ago

Building a Scalable Analytics Program: When to Bring in DXA Tools like Contentsquare

Most analytics programs begin with foundational platforms like Google Analytics or Adobe Analytics. These tools…

1 month ago

This website uses cookies.