Coursera

Open source Data Engineering with Spark, dbt & Airflow Professional Certificate

Grow your skills with Coursera Plus for $239/year (usually $399). Save now.

Coursera

Open source Data Engineering with Spark, dbt & Airflow Professional Certificate

Build Production Data Pipelines at Scale.

Explore Spark, dbt, and Airflow to design, automate, and deploy enterprise-grade data pipelines.

Included with Coursera Plus

Earn a career credential that demonstrates your expertise
Intermediate level

Recommended experience

4 weeks to complete
at 10 hours a week
Flexible schedule
Learn at your own pace
Earn a career credential that demonstrates your expertise
Intermediate level

Recommended experience

4 weeks to complete
at 10 hours a week
Flexible schedule
Learn at your own pace

What you'll learn

  • Build modular, production-grade data pipelines using Apache Spark, dbt, and Airflow to ingest, transform, and load data at scale.

  • Design and implement dimensional data models including star schemas, SCD Type 2, and incremental load strategies for data warehouses.

  • Optimize distributed data processing by resolving Spark shuffle, skew, and partitioning issues to improve pipeline performance.

  • Automate deployments and enforce data quality using CI/CD pipelines, Docker containers, and automated testing frameworks like Great Expectations.

Details to know

Shareable certificate

Add to your LinkedIn profile

Taught in English
Recently updated!

March 2026

See how employees at top companies are mastering in-demand skills

 logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Advance your career with in-demand skills

  • Receive professional-level training from Coursera
  • Demonstrate your technical proficiency
  • Earn an employer-recognized certificate from Coursera

Professional Certificate - 6 course series

What you'll learn

  • Build end-to-end data pipelines that automatically ingest from databases, APIs, and streams using Spark, dbt, and Airflow tools.

  • Design data models with historical tracking using SCD Type 2 patterns to preserve complete change history for analytics.

  • Create automated workflows with intelligent retry logic, SLA monitoring, and parameterization for production reliability.

  • Optimize Spark job performance using partitioning and caching strategies to achieve 30%+ runtime improvements.

Skills you'll gain

Category: Data Integration
Category: Data Pipelines
Category: Data Quality
Category: Configuration Management
Category: Data Warehousing
Category: Apache Airflow
Category: Data Architecture
Category: Database Development
Category: Enterprise Security
Category: Data Flow Diagrams (DFDs)
Category: Data Validation
Category: Data Transformation
Category: Apache Spark
Category: Extract, Transform, Load
Category: Data Processing
Category: Data Modeling
Optimizing Spark and Cloud Data Storage for Analytics

Optimizing Spark and Cloud Data Storage for Analytics

Course 2, 10 hours

What you'll learn

  • Optimize Spark job performance through strategic partitioning and caching, achieving 30%+ runtime improvements using data access analysis.

  • Implement transactional data lakes with Delta format, enabling versioning, ACID operations, and schema evolution for reliable datasets.

  • Provision secure cloud data infrastructure using IAM policies, private networks, and encrypted storage following security best practices.

  • Evaluate and benchmark storage formats (Parquet, ORC, Avro) to select optimal solutions for analytical workloads and cost efficiency.

Skills you'll gain

Category: Data Integrity
Category: Performance Tuning
Category: Data Warehousing
Category: Data Management
Category: PySpark
Category: Data Storage
Category: Amazon S3
Category: Data Storage Technologies
Category: Data Infrastructure
Category: Apache Spark
Category: Cloud Security
Category: Infrastructure Architecture
Category: Cloud Computing Architecture
Category: Cloud Deployment
Category: Cloud Computing
Category: Data Security
Category: Data Lakes
Category: Infrastructure as Code (IaC)
Category: Cloud Storage
Category: Transaction Processing

What you'll learn

  • Design star schema data models with fact and dimension tables that enable intuitive self-service business intelligence reporting.

  • Apply third normal form normalization to optimize database structure while maintaining query performance through indexing strategies.

  • Use advanced SQL window functions to calculate rolling metrics, rankings, and time-series analytics for complex data analysis.

  • Implement database replication and incremental loading techniques to ensure high availability and efficient data warehouse updates.

Skills you'll gain

Category: Business Intelligence
Category: Relational Databases
Category: Data Integration
Category: Data Warehousing
Category: Database Software
Category: Data Modeling
Category: Database Development
Category: Data Pipelines
Category: Performance Tuning
Category: Data Quality
Category: Database Architecture and Administration
Category: Extract, Transform, Load
Category: Star Schema
Category: SQL
Category: Database Design
DevOps and CI/CD for Data Engineering Performance

DevOps and CI/CD for Data Engineering Performance

Course 4, 12 hours

What you'll learn

  • Resolve merge conflicts and trace bugs using Git history tools, keeping collaborative codebases stable and production-ready.

  • Design branching strategies and automate deployments with CI/CD pipelines to safely promote data pipeline artifacts across environments.

  • Build and publish versioned Docker images and automate server configuration with Ansible for consistent, reproducible environments.

  • Analyze query execution metrics and optimize resource allocation to maintain performance targets in production data systems.

Skills you'll gain

Category: Application Deployment
Category: Infrastructure as Code (IaC)
Category: Version Control
Category: CI/CD
Category: Docker (Software)
Category: Ansible
Category: Continuous Deployment
Category: Root Cause Analysis
Category: Development Environment
Category: Git (Version Control System)
Category: DevOps
Category: Data Infrastructure
Category: Containerization
Category: Data Pipelines
Category: Configuration Management
Category: Continuous Integration
Category: Performance Tuning
Data Quality and Debugging for Reliable Pipelines

Data Quality and Debugging for Reliable Pipelines

Course 5, 7 hours

What you'll learn

  • Define and automate data quality tests using YAML to validate row counts, null thresholds, and uniqueness across pipeline datasets.

  • Trace data anomalies through pipeline stages by analyzing logs and dashboards to identify and fix the exact source of failure.

  • Apply advanced Python debugging tools — including conditional breakpoints, watchpoints, and pdb — to diagnose and resolve pipeline issues.

  • Resolve complex concurrency bugs by reading stack traces and correlating thread logs to identify deadlocks and race conditions in code.

Skills you'll gain

Category: Data Integrity
Category: DevOps
Category: Performance Tuning
Category: Data Pipelines
Category: YAML
Category: Debugging
Category: Reliability
Category: Data Validation
Category: Python Programming
Category: Development Testing
Category: Test Automation
Category: Dashboard
Category: Anomaly Detection
Category: Root Cause Analysis
Category: Generative AI
Category: Data Quality
Career Development For Open Source Data Engineering

Career Development For Open Source Data Engineering

Course 6, 2 hours

What you'll learn

  • Build a data engineering portfolio with end-to-end pipeline projects that prove your ability to design, build, and deploy production-style systems.

  • Create a resume, LinkedIn profile, and GitHub presence that position you as a hands-on data engineer ready to contribute from day one.

  • Practice real data engineering interview scenarios and develop structured responses to technical, design, and behavioral questions.

  • Execute a 30-day career launch plan covering portfolio completion, job applications, and networking in the data engineering community.

Skills you'll gain

Category: GitHub
Category: Communication
Category: Apache Spark
Category: Python Programming
Category: Apache
Category: Professional Networking
Category: Professional Development
Category: Interviewing Skills
Category: SQL
Category: Software Development
Category: Collaboration
Category: Data Pipelines
Category: Portfolio Management
Category: Data Infrastructure
Category: Data Quality
Category: Apache Airflow

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Professionals from the Industry
376 Courses54,291 learners

Offered by

Coursera

Why people choose Coursera for their career

Felipe M.

Learner since 2018
"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020
"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021
"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."
Coursera Plus

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

¹Based on Coursera learner outcome survey responses, United States, 2021.