kafka-connect-cosmosdb
kafka-connect-cosmosdb copied to clipboard
Setting lease container token with Kafka offsets does not handle multiple tasks
Description
Suppose we have a source connector with multiple tasks. Each task will try to reset the continuation token on the lease container (if using useLatestOffset=False) and there will be duplicate entries as each task will see it's own changes. The same applies when using useLatestOffset=True, since each task will attempt to rewind the changes and read them again.
Expected Behavior
Multiple tasks should be able to rewind the continuation token only once so that there won't be any repeat processing/duplicates.
Steps To Reproduce:
- Configure Source Connector with single task
- Insert some documents in cosmosdb
- Reconfigure Source Connector with 2 tasks
- Insert more documents in cosmosdb
- Observe message feed in kafka
Looking at how the other DB connectors have approached the source offset issue. Interestingly, most of them do not support multiple source tasks at all.
MongoDB: https://jira.mongodb.org/browse/KAFKA-121 Debezium: https://github.com/debezium/debezium/blob/master/debezium-embedded/src/main/java/io/debezium/connector/simple/SimpleSourceConnector.java#L91 JDBC: https://github.com/confluentinc/kafka-connect-jdbc/blob/master/src/main/java/io/confluent/connect/jdbc/JdbcSourceConnector.java#L137
Not advocating that our connector goes with the same approach but rather, we might need to ignore handling offsets between multiple tasks, especially when monitoring the same Cosmos Container.
- New issue: document in connector that it supports at least once with multiple tasks and exactly once for single task (#334)