Understanding Watermark
In stream processing, a watermark is a mechanism used to track the progress of data processing in one or more streams. It helps determine when all events up to a specific timestamp have likely arrived, enabling the system to process time-based operations like window aggregations.
Since stream processing deals with unbounded, out-of-order data, watermarks are essential for defining bounded windows and for discarding late or outdated data in a consistent way. Watermarks allow the system to advance event-time, trigger computations (e.g., closing a tumbling window and emitting results), and maintain accuracy even when data arrives with delays.
Try It Out
To better understand how watermarks work, use the following tool. It generates random data and simulates watermark progression for a tumbling window aggregation with a 5-second window size and a 2-second watermark delay.
You can click the Play button to start the simulation, or click the Next or Prev to view the next or previous step.
Watermark & Window Processing
SELECT window_start,count(),sum(v),avg(v)
FROM tumble(stream, 5s)
GROUP BY window_start
EMIT AFTER WATERMARK WITH DELAY 2s