kafka-connect-hdfs icon indicating copy to clipboard operation
kafka-connect-hdfs copied to clipboard

how to ensure exactly once delivery?

Open kimnami opened this issue 4 years ago • 1 comments

https://docs.confluent.io/kafka-connect-hdfs3-sink/current/overview.html#exactly-once-delivery

The connector uses a write-ahead log to ensure each record is written to HDFS exactly once. Also, the connector manages offsets by encoding the Kafka offset information into the HDFS file so that it can start from the last committed offsets in case of failures and task restarts.

Those are for ensuring it in case of failures. I wonder how this connector ensures exactly-once in normal status.

Is HdfsSinkConnector idempotent and Transactional? Where could I find it out?


My question is about how to avoid duplicates during writing in temp file.

For example, let's assume that the last committed offset in HDFS file is 10 and the flush size is 10. Then the connector would consume from 11 to 20 before committed.

In this situation, during consuming 11 ~20 in temp file, how does it avoid duplicates? I think there is no offset info to read in middle of writing in temp file, isnt it?

kimnami avatar Jul 30 '21 18:07 kimnami

You seem to be asking about the HDFS3 connector.

This repo is for the HDFS2 one

OneCricketeer avatar Oct 20 '21 12:10 OneCricketeer