kafka-connect-storage-cloud icon indicating copy to clipboard operation
kafka-connect-storage-cloud copied to clipboard

Implement a pool for multipart uploads

Open danosipov opened this issue 5 years ago • 5 comments

While implementing Kafka Connect pipeline that writes to S3 at Datadog, we found the servers were not completely utilized, having idle CPU and more network bandwidth available. Profiling the process revealed ~90% of time is spent in upload to S3. It also looked that each chunk was uploaded serially - blocking reading while its being uploaded. We implemented this approach of a pool of workers utilizing S3 multipart uploads in parallel, and found the utilization of machines could be improved (and as a result were able to scale down the cluster size required to keep up with the Kafka topic).

danosipov avatar Mar 13 '19 15:03 danosipov

@confluentinc It looks like @danosipov just signed our Contributor License Agreement. :+1:

Always at your service,

clabot

hackrole avatar Mar 13 '19 15:03 hackrole

Is this project dead? This certainly seems like a PR that would be merged.

brokenjacobs avatar Oct 28 '19 20:10 brokenjacobs

Taking a look...

cjolivier01 avatar Dec 21 '21 18:12 cjolivier01

@danosipov would you be able to add some tests?

cjolivier01 avatar Dec 23 '21 00:12 cjolivier01

@danosipov would you be able to add some tests?

Thanks for taking a look @cjolivier01 ! It's been a while, but IIRC existing test suite would cover the code, as it changes the upload code path.

danosipov avatar Jan 04 '22 20:01 danosipov