decaton icon indicating copy to clipboard operation
decaton copied to clipboard

Introduce a timeout for destroying processors

Open ocadaruma opened this issue 2 years ago • 0 comments

Summary

  • Currently, rebalance-listener could block indefinitely in below scenario:
    • Let's say:
      • There are 3 partitions (0,1,2)
      • 3 instances (X,Y,Z)
      • partition-0 is assigned to X
      • max.poll.interval.ms is set to long value because sometimes task takes long time to complete (Basically Decaton doesn't block poll() by task processing, but there's an exception in rebalance-listener)
    • Then, let's say partition-0's assignment is taken over by Y by reassignment triggered by Z's restart
    • rebalance-listener's onPartitionAssigned calls partition-0's PartitionProcessor#close, which waits processor unit's executor to terminate indefinitely
    • If long-running task is being executed by a processor, onPartitionAssigned will be blocked until the task completes
    • When Z came back to the group, it initiates another rebalance, but it never finishes because X is in stuck at onPartitionAssigned, so cannot send JoinGroup
    • Meanwhile, all processing on all instances are stuck
  • Though this behavior is Decaton's design decision (i.e. "rebalance could block up to max.poll.interval.ms at max"), some users use Decaton for processing long-running tasks, and aborting task "ungracefully" is preferable to waiting task indefinitely
  • So it might be better to have an option to set timeout on destroying processors

ocadaruma avatar Jul 15 '21 06:07 ocadaruma