kafka-connect-cosmosdb icon indicating copy to clipboard operation
kafka-connect-cosmosdb copied to clipboard

Setting lease container token with Kafka offsets does not handle multiple tasks

Open sivamu opened this issue 4 years ago • 2 comments

Description

Suppose we have a source connector with multiple tasks. Each task will try to reset the continuation token on the lease container (if using useLatestOffset=False) and there will be duplicate entries as each task will see it's own changes. The same applies when using useLatestOffset=True, since each task will attempt to rewind the changes and read them again.

Expected Behavior

Multiple tasks should be able to rewind the continuation token only once so that there won't be any repeat processing/duplicates.

Steps To Reproduce:

  • Configure Source Connector with single task
  • Insert some documents in cosmosdb
  • Reconfigure Source Connector with 2 tasks
  • Insert more documents in cosmosdb
  • Observe message feed in kafka

sivamu avatar Feb 10 '21 20:02 sivamu

Looking at how the other DB connectors have approached the source offset issue. Interestingly, most of them do not support multiple source tasks at all.

MongoDB: https://jira.mongodb.org/browse/KAFKA-121 Debezium: https://github.com/debezium/debezium/blob/master/debezium-embedded/src/main/java/io/debezium/connector/simple/SimpleSourceConnector.java#L91 JDBC: https://github.com/confluentinc/kafka-connect-jdbc/blob/master/src/main/java/io/confluent/connect/jdbc/JdbcSourceConnector.java#L137

Not advocating that our connector goes with the same approach but rather, we might need to ignore handling offsets between multiple tasks, especially when monitoring the same Cosmos Container.

sivamu avatar Feb 11 '21 16:02 sivamu

  • New issue: document in connector that it supports at least once with multiple tasks and exactly once for single task (#334)

brandynbrown avatar Feb 18 '21 20:02 brandynbrown