kafka-connect-storage-cloud Implement a pool for multipart uploads

Implement a pool for multipart uploads

Open danosipov opened this issue 5 years ago • 5 comments

While implementing Kafka Connect pipeline that writes to S3 at Datadog, we found the servers were not completely utilized, having idle CPU and more network bandwidth available. Profiling the process revealed ~90% of time is spent in upload to S3. It also looked that each chunk was uploaded serially - blocking reading while its being uploaded. We implemented this approach of a pool of workers utilizing S3 multipart uploads in parallel, and found the utilization of machines could be improved (and as a result were able to scale down the cluster size required to keep up with the Kafka topic).