stop rotating on partition change when rotate.interval.ms is set
Problem
Committing open files on partition change results in creating a lot of small files when records belonging to different partitioned are interleaved. We have a use case where we aggregate raw events into sessions spanning for 5-15 minutes with session time being the time of the first event in it. We use hourly partitioning and observe up to 10x increase in number of files per hour due to this.
The issue has been reported multiple time previously:
- https://github.com/confluentinc/kafka-connect-storage-cloud/issues/380
- https://github.com/confluentinc/kafka-connect-storage-cloud/issues/529
And even had an attempted fix:
- https://github.com/confluentinc/kafka-connect-storage-cloud/pull/574
Solution
Removing rotation on partition changes makes the semantics of rotate.interval.ms similar to flush.size.
It now defines constrains not for a single file, but for a "segment" of a stream:
records are accumulated in appropriate partitions until partition time advances at least rotate.interval.ms from the first time of the message in the "segment", at which point all files are flushed.
Testing
We have been running the patched version in our staging environment for more then a week now with constant consistency checks and have not seen any issues neither with number of files per hour nor with the consistency of the results.
Finally documentation for 'rotate.interval.ms' might need to be adjusted. Would appreciate any advice on how to do that.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Jenkins build complains about missing 204-ccs artifacts, which I don't see in the dependency tree locally:
[ERROR] Failed to execute goal on project kafka-connect-s3: Could not resolve dependencies for project io.confluent:kafka-connect-s3:jar:10.6.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.kafka:connect-runtime:jar:test:7.6.0-204-ccs, org.apache.kafka:kafka_2.12:jar:7.6.0-204-ccs, org.apache.kafka:kafka_2.12:jar:test:7.6.0-204-ccs: org.apache.kafka:connect-runtime:jar:test:7.6.0-204-ccs was not found in https://s3-us-west-2.amazonaws.com/confluent-snapshots/ during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of confluent-snapshots has elapsed or updates are forced -> [Help 1]
I would appreciate any guidance on how to fix this.