error cluster::tx_errc:16 on rolling previous tx.id

Open freef4ll opened this issue 3 years ago • 1 comments

Version & Environment

v22.2.2 (rev b6d1d46), Ubuntu 20.04.5

What went wrong?

Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,594 [shard 0] tx - tx_gateway_frontend.cc:555 - processing name:init_tm_tx, tx_id:{librdkafka_transactions_example_16807}
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,594 [shard 0] tx - tx_gateway_frontend.cc:45 - got_lock name:init_tm_tx, tx_id:{librdkafka_transactions_example_16807}
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,594 [shard 0] tx - rm_group_frontend.cc:434 - dispatching name:commit_group_tx, group_id:{librdkafka_tx_group_773122}, pid:{producer_identity: id=1, epoch=9}, tx_seq:10797, from:0, to:3
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,594 [shard 0] tx - rm_partition_frontend.cc:559 - processing name:commit_tx, ntp:{kafka/transaction-output/1}, pid:{producer_identity: id=1, epoch=9}, tx_seq:10797
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,594 [shard 1] cluster - rm_stm.cc:573 - Can't find prepare for pid:{producer_identity: id=1, epoch=9} => replaying already comitted commit
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,594 [shard 0] tx - rm_partition_frontend.cc:567 - sending name:commit_tx, ntp:{kafka/transaction-output/1}, pid:{producer_identity: id=1, epoch=9}, tx_seq:10797, ec:cluster::tx_errc:0
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,595 [shard 0] tx - rm_group_frontend.cc:445 - received name:commit_group_tx, group_id:{librdkafka_tx_group_773122}, pid:{producer_identity: id=1, epoch=9}, tx_seq:10797, ec:cluster::tx_errc:0
Sep  7 16:11:36 redpanda-1 rpk[2574812]: WARN  2022-09-07 16:11:36,595 [shard 0] tx - tx_gateway_frontend.cc:770 - got error cluster::tx_errc:16 on rolling previous tx.id={librdkafka_transactions_example_16807} with status=2
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,595 [shard 0] tx - tx_gateway_frontend.cc:49 - released_lock name:init_tm_tx, tx_id:{librdkafka_transactions_example_16807}
Sep  7 16:11:36 redpanda-1 rpk[2574812]: TRACE 2022-09-07 16:11:36,595 [shard 0] tx - tx_gateway_frontend.cc:584 - sending name:init_tm_tx, tx_id:{librdkafka_transactions_example_16807}, pid:{producer_identity: id=-1, epoch=0}, ec: cluster::tx_errc:16
Sep  7 16:11:36 redpanda-1 rpk[2574812]: WARN  2022-09-07 16:11:36,595 [shard 0] kafka - init_producer_id.cc:70 - failed to allocate pid, ec: cluster::tx_errc:16

Application wasn't able to create transactions:

%7|Wed Sep  7 16:11:34 2022.279|GETPID|rdkafka#producer-1| [thrd:main]: TxnCoordinator/0: Failed to acquire PID: Broker: Broker not available
%7|Wed Sep  7 16:11:34 2022.279|IDEMPSTATE|rdkafka#producer-1| [thrd:main]: Idempotent producer state change WaitPID -> RequestPID
%4|Wed Sep  7 16:11:34 2022.279|GETPID|rdkafka#producer-1| [thrd:main]: Failed to acquire transactional PID from broker TxnCoordinator/0: Broker: Broker not available: retrying
%7|Wed Sep  7 16:11:34 2022.279|TXN|rdkafka#producer-1| [thrd:main]: Starting PID FSM timer: Failed to acquire transactional PID from broker TxnCoordinator/0: Broker: Broker not available
[...]
%7|Wed Sep  7 16:12:08 2022.542|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/0: Received InitProducerIdResponse (v1, 16 bytes, CorrId 68, rtt 33.16ms)
%7|Wed Sep  7 16:12:08 2022.542|GETPID|rdkafka#producer-1| [thrd:main]: TxnCoordinator/0: Failed to acquire PID: Broker: Broker not available
%7|Wed Sep  7 16:12:08 2022.542|IDEMPSTATE|rdkafka#producer-1| [thrd:main]: Idempotent producer state change WaitPID -> RequestPID
%4|Wed Sep  7 16:12:08 2022.542|GETPID|rdkafka#producer-1| [thrd:main]: Failed to acquire transactional PID from broker TxnCoordinator/0: Broker: Broker not available: retrying
%7|Wed Sep  7 16:12:08 2022.542|TXN|rdkafka#producer-1| [thrd:main]: Starting PID FSM timer: Failed to acquire transactional PID from broker TxnCoordinator/0: Broker: Broker not available

How to reproduce the issue?

Local cluster of 5 nodes under virtualization. Cluster is empty but to enable rack_awareness, cluster is restarted quickly in a succession and ~4 seconds apart.

for i in $(seq 1 5); do echo "host $i: $(date)"; multipass exec redpanda-$i -- /usr/bin/sudo /usr/bin/systemctl restart redpanda; done
host 1: Wed Sep  7 14:53:12 UTC 2022
host 2: Wed Sep  7 14:53:16 UTC 2022
host 3: Wed Sep  7 14:53:20 UTC 2022
host 4: Wed Sep  7 14:53:24 UTC 2022
host 5: Wed Sep  7 14:53:28 UTC 2022

After cluster rpk cluster health and rpk cluster status show all nodes, start transaction generator as per https://github.com/redpanda-data/redpanda/issues/6314

Additional information

librdkafka logs: transactions_20220907_1611_broker_1_not_available.txt

trace level from the broker: syslog-20220907_1611_broker_1_not_available.txt.gz

https://github.com/redpanda-data/redpanda/issues/2580 is rather similar.

Sep 07 '22 15:09 freef4ll

this issue depends on https://github.com/redpanda-data/redpanda/issues/6978

Nov 04 '22 17:11 rystsov