Setting Up Change Data Capture (CDC) for Snowflake: PostgreSQL, MySQL and MongoDB

Introduction

Change Data Capture (CDC) enables real-time replication of changes from operational databases to data warehouses like Snowflake. By capturing and propagating data changes, CDC ensures that the target system remains synchronized with the source, supporting analytics, reporting, and machine learning workloads.

This guide focuses on setting up CDC for PostgreSQL, MySQL, and MongoDB using Kafka connectors. While the underlying setup process is similar, each database has unique mechanisms for logging and capturing changes.

What is Change Data Capture (CDC)?

CDC tracks and captures database changes—such as inserts, updates, and deletes—and streams them to a target system like Snowflake. This process is critical for building real-time, scalable data pipelines. Tools like Kafka handle the streaming, while AWS Secrets Manager secures sensitive credentials during the process.

General Steps to Configure CDC

Regardless of the source database, setting up CDC involves the following steps:

Configure Kafka Connector YAML:
- Define the source database configuration.
- Specify Snowflake as the destination.
- Use database-specific options (details below).
Secure Credentials with AWS Secrets Manager:
- Store sensitive information, such as usernames, passwords, and private keys, securely in AWS Secrets Manager.
- Reference these credentials in the YAML file using ${secretsManager:<path>}.
Validate Source Database Settings:
- Enable replication features required for CDC. These features vary by database:
  - PostgreSQL: Logical replication via publications and replication slots.
  - MySQL: Binlog events.
  - MongoDB: Oplog in replica sets.
Deploy and Test:
- Deploy the Kafka connector and verify that changes from the source database are captured and successfully propagated to Snowflake.

What’s Happening During the Setup?

At a high level, the Kafka connector acts as the middleware between your source database and Snowflake. Here's what happens during the CDC setup and operation:

Change Logging in the Source Database:
- Each database uses a different mechanism to log changes:
  - PostgreSQL: Logical replication creates a publication that captures changes to specified tables and writes them to a replication slot.
  - MySQL: Binlog events log all database transactions, serving as a source of truth for changes.
  - MongoDB: The oplog, a special capped collection in MongoDB replica sets, logs all write operations.
Connector Monitoring:
- The Kafka connector monitors these logs (replication slot, binlog, or oplog) and streams the changes into Kafka topics.
Data Transformation:
- Kafka processes these changes and prepares them for the destination system.
Streaming to Snowflake:
- The processed data is streamed to Snowflake, where it is ingested into the target database and schema.

This setup ensures that every change in the source database is captured and mirrored in Snowflake in near real-time.

YAML Configuration Examples

PostgreSQL:

source:

type: postgres

hostname: <hostname>

port: 5432

user: ${secretsManager:<path-to-username>}

password: ${secretsManager:<path-to-password>}

database: <database_name>

publicationName: <publication_name>

slotName: <slot_name>

destination:

type: snowflake

urlName: <snowflake_url>

databaseName: <database_name>

schemaName: <schema_name>

userName: <username>

privateKey: ${secretsManager:<path-to-private-key>}

privateKeyPassphrase: ${secretsManager:<path-to-passphrase>}

MySQL:

source:

type: mysql

hostname: <hostname>

port: 3306

user: ${secretsManager:<path-to-username>}

password: ${secretsManager:<path-to-password>}

serverId: "<unique-server-id>"

destination:

type: snowflake

urlName: <snowflake_url>

databaseName: <database_name>

schemaName: <schema_name>

userName: <username>

privateKey: ${secretsManager:<path-to-private-key>}

privateKeyPassphrase: ${secretsManager:<path-to-passphrase>}

MongoDB:

source:

type: mongodb

hosts: <hostname>

user: ${secretsManager:<path-to-username>}

password: ${secretsManager:<path-to-password>}

capturemode: oplog

destination:

type: snowflake

urlName: <snowflake_url>

databaseName: <database_name>

schemaName: <schema_name>

userName: <username>

privateKey: ${secretsManager:<path-to-private-key>}

privateKeyPassphrase: ${secretsManager:<path-to-passphrase>}

Conclusion

Setting up CDC for PostgreSQL, MySQL, and MongoDB follows a consistent process: configuring the source and destination in a Kafka connector YAML file, securing credentials, and ensuring that the source database's replication features are correctly enabled.

While the databases differ in how they log changes—logical replication for PostgreSQL, binlog for MySQL, and oplog for MongoDB—the underlying concept of streaming changes to Snowflake remains the same. By understanding these mechanisms, you can build robust and secure real-time data pipelines to keep Snowflake synchronized with your operational databases.

Stay tuned for advanced topics in Part II, including incremental snapshots and best practices for optimizing CDC performance.

Setting Up Change Data Capture (CDC) for Snowflake: PostgreSQL, MySQL and MongoDB

Introduction

What is Change Data Capture (CDC)?

General Steps to Configure CDC

What’s Happening During the Setup?

PostgreSQL:

MySQL:

MongoDB:

Conclusion

Recent Posts

Comments

Need More Details?

Contact Us

SquareShift helps businesses redefine success with innovative Cloud, Data, and AI solutions

Industries

Solutions

Case Stuidies

Insights

Data

Digital

Elastic Solutions

Banking and
Financial Services

Retail

Elastic case studies

BI & Data case studies

Blogs

E-Book & Brouchers

Webinar

AI & ML

Hi-Tech

Company

About Us

Careers