kafka-flow icon indicating copy to clipboard operation
kafka-flow copied to clipboard

Support initial delay on timer triggers after a partition is assigned

Open feli6 opened this issue 2 months ago • 0 comments
trafficstars

As mentioned in #732 , kafka-flow does not ensure single owner for a partition during specific rebalance scenarios caused by broker issues. This results in corrupted state snapshots.

In general, when there is broker maintenance or network issues, the higher producer latencies could cause the current poll operation handling to continue even though the partition is moved to another node due to rebalance. Most of the state corruption incidents have happened when there were frequent rebalances causing lost partitions instead of graceful handover where the partition is successfully unassigned from the current node and then assigned to another node.

The issue could be mitigated up to certain extent, if we give enough time for the initial rebalance chaos to settle, which usually causes a lot resource consumption and contention.

I suggest adding a configurable property for initial delay on the timer triggers, after a partition is assigned to a node. This will avoid any state flushing when there is a very high chance of another rebalance. Once the rebalance has reached a steady state, the app can start flushing the state and the offsets.

feli6 avatar Aug 28 '25 11:08 feli6