kafka-connect-storage-cloud
kafka-connect-storage-cloud copied to clipboard
Implement a pool for multipart uploads
While implementing Kafka Connect pipeline that writes to S3 at Datadog, we found the servers were not completely utilized, having idle CPU and more network bandwidth available. Profiling the process revealed ~90% of time is spent in upload to S3. It also looked that each chunk was uploaded serially - blocking reading while its being uploaded. We implemented this approach of a pool of workers utilizing S3 multipart uploads in parallel, and found the utilization of machines could be improved (and as a result were able to scale down the cluster size required to keep up with the Kafka topic).
@confluentinc It looks like @danosipov just signed our Contributor License Agreement. :+1:
Always at your service,
clabot
Is this project dead? This certainly seems like a PR that would be merged.
Taking a look...
@danosipov would you be able to add some tests?
@danosipov would you be able to add some tests?
Thanks for taking a look @cjolivier01 ! It's been a while, but IIRC existing test suite would cover the code, as it changes the upload code path.