erlkaf icon indicating copy to clipboard operation
erlkaf copied to clipboard

Potential consumer queue cleanup issue?

Open hikui opened this issue 3 years ago • 1 comments
trafficstars

erlkaf version: 2.0.8

Hi Silviu,

Thanks for your work.

We are currently using erlkaf in our server. However, I noticed sometimes when partition rebalance is triggered, some partition consumers may get stuck. This happens when a partition consumer is doing heavy work that takes long time.

After dig deeper in erlkaf, I found that when :revoke_partition is received by the consumer group, it tries to stop all partition consumers. It does so by sending a stop message to each partition consumer and wait for 5s. Now that we have a heavy consumer that is doing a task longer than 5s, it will then be force killed. When this happens, consumer_queue_cleanup is not run. I suspect it may be related to partition stuck issue.

Not sure if my understanding is correct.

hikui avatar Sep 12 '22 14:09 hikui

Your understanding on the working flow it's correct. So far I didn't encounter this problem even if I process over 20 M messages a day on the platform where we are using erlkaf. But it's true that all our events are designed in a way that doesn't take more than few seconds.

I don't have too much time in this moment to look into this but calling of consumer_queue_cleanup it's taking also place when the erlang is releasing the state of the killed process. And even if it doesn't leak this might get called after the re-balancing is taking place and might lead to this problem.

I'm wonder if we change the code to kill the process with normal reason will fix your problem.

silviucpp avatar Sep 12 '22 15:09 silviucpp

Merged in master.

silviucpp avatar Oct 23 '22 12:10 silviucpp