kafka-connect-storage-cloud icon indicating copy to clipboard operation
kafka-connect-storage-cloud copied to clipboard

[feature request] flush instead of crashing in case of memory shortage

Open mdespriee opened this issue 3 years ago • 6 comments

Hello,

We archive on s3 the data going through certain topics, we configured a couple of s3 sink connectors to do this.

To have this data usable/exploitable, we try to avoid having it too much fragmented, with numerous small files on s3. So we configured the time partitioning, flush size, rotate interval to have larger files. But now our connect cluster is unstable and fails randomly with OOMs, because of too much data sitting in memory. And now we're fiddling with s3.part.size, following this SO post.

Worse, the situation is a kind of dead-end: after an OOM, if we restart the connector (or the whole connect), the same situation will happen again (data does not fit better in memory), and the cluster won't resume to normal operation. So we're good to reconfigure memory settings or flush size to resume operation.

This is not satisfactory at all, and I'd prefer having a rock-solid connect cluster I can trust, and avoid being too dependent on any hard-limit.

What about flushing automatically the data to s3 when memory becomes scarce ? With a couple of error/warning logs of course. At least the connector wouldn't crash/stop randomly and require an on-call engineer to fiddle with settings to resume the connector.

Or maybe I miss the point, or a config settings. In that case, I'd be happy to learn ;-)

mdespriee avatar Oct 07 '21 13:10 mdespriee

For example, s3.part.size = 5MB means that task upload at most 5MB buffer bytes to s3 in one time.

Here are some ideas, I don't know if it can help you: Reduce this param will help you reduce the memory malloc for one s3 outputstream. Increase the connect-cluster machine number can make one machine take on fewer tasks ( and fewer s3 outputstream)

tjgykhulj avatar Oct 26 '21 04:10 tjgykhulj

No do not reduce s3.part.size !

Use this -> https://github.com/confluentinc/kafka-connect-storage-cloud/pull/320

raphaelauv avatar Oct 29 '21 05:10 raphaelauv

@raphaelauv well, it sounds promising, but these PRs seems stalled... Any idea when they could land ?

mdespriee avatar Nov 05 '21 10:11 mdespriee

Confluent don't review contributions , so very probably never

raphaelauv avatar Nov 05 '21 11:11 raphaelauv

No do not reduce s3.part.size !

Use this -> #320

Hi, I wonder why we shouldn't reduce s3.part.size?

tjgykhulj avatar Nov 17 '21 09:11 tjgykhulj

@tjgykhulj

is generally considered that a s3 optimal file size for analytic needs is around 200mo

check internet for more details

raphaelauv avatar Apr 06 '22 08:04 raphaelauv