Skip to main content

Streaming ETL: Kafka to Kafka

You can quickly build streaming ETL pipelines with Timeplus. For example, the original web access logs in Kafka topics contain the raw IP address. To further protect user privacy, you can build a data pipeline to read new data from Kafka, masking the IP address and send to a different Kafka topic.

Follow the guide for Timeplus Proton or Timeplus Cloud.

Timeplus Proton

You can follow the previous tutorial to setup the sample data and run the following SQL to build the pipeline.

-- read the topic via an external stream
CREATE EXTERNAL STREAM frontend_events(raw string)
SETTINGS type='kafka',
brokers='redpanda:9092',
topic='owlshop-frontend-events';

-- create the other external stream to write data to the other topic
CREATE EXTERNAL STREAM target(
_tp_time datetime64(3),
url string,
method string,
ip string)
SETTINGS type='kafka',
brokers='redpanda:9092',
topic='masked-fe-event',
data_format='JSONEachRow',
one_message_per_row=true;

-- setup the ETL pipeline via a materialized view
CREATE MATERIALIZED VIEW mv INTO target AS
SELECT now64() AS _tp_time,
raw:requestedUrl AS url,
raw:method AS method,
lower(hex(md5(raw:ipAddress))) AS ip
FROM frontend_events;

Timeplus Cloud

A blog is published with the detailed steps to read data from Kafka/Redpanda, apply the transformation and send to Kafka/Redpanda.

A few key steps:

  1. Connect to Redpanda:

add data

  1. Specify the Redpanda broker address and the authentication method.

broker

  1. Choose a topic and preview data.

preview

  1. Set the name for the external stream, say frontend_events.

set name

  1. Explore the live data in the stream/topic.

explore

  1. Write a streaming SQL to transform data.
SELECT response:statusCode as code,hex(md5(ipAddress)) as hashed_ip,method,requestedUrl 
FROM frontend_events WHERE response:statusCode!='200'

sql

  1. Send the result to another topic. Timeplus will create a new external stream as the destination and a materialized view as pipeline.

sink

The data lineage visualizes the relationships.

lineage

  1. New data is available in Kafka/Redpanda topic.

result