aiokafka icon indicating copy to clipboard operation
aiokafka copied to clipboard

[QUESTION] Node Failing During a Transaction (Consume-Process-Produce)

Open jameskirch opened this issue 1 year ago • 0 comments

We have followed the transactional consume-process-produce paradigm laid out at:

https://aiokafka.readthedocs.io/en/stable/examples/transaction_example.html

We seem to be seeing issues when a node is restarted (or becomes unreachable) while a transaction is in the middle of processing. The main scenario where this occurs is on high-traffic producers when Amazon conducts a rolling restart of their servers for maintenance etc. This inevitably leads to a node becoming unreachable whilst a producer is mid-transaction.

Instead of failing over to another node that is reachable, 'send_offsets_to_transaction' seems to keep spamming the problem node until timeouts are eventually hit and everything crashes (the same errors continue to spam the logs once the node recovers as well, which is strange):

Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')
Unable connect to node with id 2: [Errno 111] Connect call failed ('<ip>', <port>)
Could not send <class 'aiokafka.protocol.transaction.TxnOffsetCommitRequest_v0'>: NodeNotReadyError('Attempt to send a request to node which is not ready (node id 2).')

...(above repeats 100s of times until an eventual timeout)

What is the intended behavior when a node becomes unreachable in the middle of a producer transaction? Is it inevitable that that the transaction will fail?

Is it possible to catch NodeNotReadyErrors so we can perhaps abort the transaction and start a new one, rather than having it get stuck in a loop and failing?

aiokafka.version == '0.8.0' kafka.version == '2.0.2'

jameskirch avatar Jul 10 '23 23:07 jameskirch