redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Transaction previous error (REBALANCE_IN_PROGRESS) already raised

Open freef4ll opened this issue 3 years ago • 0 comments

Version & Environment

Redpanda version: v0.0.0-20221103gita5c4127 (rev a5c4127)

What went wrong?

A small stress test to simulate some clients crashing.

Client 1(10.142.65.19) is constantly running. Client 2(10.142.65.72) and Client 3(10.142.65.207) get started and get killed after 30 to 39 seconds in a constant loop.

  1. Client 1: 09:02:30 transaction is aborted since group rebalance is in progress
  2. Client 2: 09:02:11 Received FindCoordinatorResponse, and rebalance happening for 32 seconds
  3. Client 3: 09:02:30 TCP session closed from previous instance, but at 09:02:26 a FindCoordinatorResponse is received.

The broker 10.142.65.207 rpk logs:

Nov  3 09:01:00 sven-tests3 rpk[3433663]: INFO  2022-11-03 09:01:00,526 [shard 0] kafka - connection_context.cc:391 - Disconnected 10.142.65.72:40974 (stream closed)
Nov  3 09:01:45 sven-tests3 rpk[3433663]: INFO  2022-11-03 09:01:45,534 [shard 4] kafka - connection_context.cc:391 - Disconnected 10.142.65.72:41332 (Connection reset by peer)
Nov  3 09:02:30 sven-tests3 rpk[3433663]: INFO  2022-11-03 09:02:30,527 [shard 1] kafka - connection_context.cc:391 - Disconnected 10.142.65.207:39522 (stream closed)

Client 1:

%7|Thu Nov  3 09:02:30 2022.531|SEND|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Sent AddPartitionsToTxnRequest (v0, 70 bytes @ 0, CorrId 1362603)
%7|Thu Nov  3 09:02:30 2022.531|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Received AddPartitionsToTxnResponse (v0, 30 bytes, CorrId 1362603, rtt 0.48ms)
%7|Thu Nov  3 09:02:30 2022.531|ADDPARTS|rdkafka#producer-1| [thrd:main]: txn_output [5] registered with transaction
%7|Thu Nov  3 09:02:30 2022.531|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.207:9092/3: Wake-up: partitions added to transaction
%7|Thu Nov  3 09:02:30 2022.531|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.72:9092/2: Wake-up: partitions added to transaction
%7|Thu Nov  3 09:02:30 2022.531|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.19:9092/1: Wake-up: partitions added to transaction
%7|Thu Nov  3 09:02:30 2022.531|WAKEUP|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: Wake-up: partitions added to transaction
%7|Thu Nov  3 09:02:30 2022.531|WAKEUP|rdkafka#producer-1| [thrd:main]: Wake-up sent to 4 broker threads in state >= INIT: partitions added to transaction
%7|Thu Nov  3 09:02:30 2022.531|TOPPAR|rdkafka#producer-1| [thrd:10.142.65.72:9092/2]: 10.142.65.72:9092/2: txn_output [5] 1 message(s) in xmit queue (1 added from partition queue)
%7|Thu Nov  3 09:02:30 2022.531|RECV|rdkafka#producer-1| [thrd:10.142.65.207:9092/3]: 10.142.65.207:9092/3: Received TxnOffsetCommitResponse (v3, 26 bytes, CorrId 603566, rtt 0.99ms)
%3|Thu Nov  3 09:02:30 2022.531|TXNERR|rdkafka#producer-1| [thrd:main]: Current transaction failed in state InTransaction: Failed to commit offsets to transaction on broker 10.142.65.207:9092/3: txn_input[1](Broker: Group rebalance in progress) (after 0ms) (REBALANCE_IN_PROGRESS)
%7|Thu Nov  3 09:02:30 2022.531|TXNSTATE|rdkafka#producer-1| [thrd:main]: Transaction state change InTransaction -> AbortableError
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [0]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [1]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [2]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [3]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [4]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [5]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:main]: txn_output [-1]: purging queues (purge_flags 0x105, exclude xmit_msgq)
%7|Thu Nov  3 09:02:30 2022.531|PURGEQ|rdkafka#producer-1| [thrd:main]: Purged 0 message(s) from 0 UA-partition(s)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Purging queues with flags queue,non-blocking
%7|Thu Nov  3 09:02:30 2022.531|PURGEQ|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Purged 0 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%7|Thu Nov  3 09:02:30 2022.531|PURGEQ|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Purged 0 message(s) from 0 partition(s)
%7|Thu Nov  3 09:02:30 2022.531|PURGE|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/1: Purging queues with flags queue,non-blocking
WARNING: Failed to send offsets to transaction: REBALANCE_IN_PROGRESS: Failed to commit offsets to transaction on broker 10.142.65.207:9092/3: txn_input[1](Broker: Group rebalance in progress) (after 0ms): aborting transaction
%7|Thu Nov  3 09:02:30 2022.531|PURGEQ|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/1: Purged 0 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%7|Thu Nov  3 09:02:30 2022.531|TXNAPI|rdkafka#producer-1| [thrd:app]: Transactional API called: abort_transaction (begin) (in txn state AbortableError, idemp state Assigned)
[...]
%7|Thu Nov  3 09:02:30 2022.533|TXNCOMMIT|rdkafka#producer-1| [thrd:app]: Transaction abort message purge and flush complete
%7|Thu Nov  3 09:02:30 2022.533|TXNAPI|rdkafka#producer-1| [thrd:app]: Transactional API called: abort_transaction (in txn state AbortingTransaction, idemp state Assigned)
%7|Thu Nov  3 09:02:30 2022.533|SEND|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Sent EndTxnRequest (v1, 47 bytes @ 0, CorrId 1362604)
%7|Thu Nov  3 09:02:40 2022.150|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Received EndTxnResponse (v1, 6 bytes, CorrId 1362604, rtt 9616.64ms)
%7|Thu Nov  3 09:02:40 2022.150|ENDTXN|rdkafka#producer-1| [thrd:main]: EndTxn returned UNKNOWN in state AbortingTransaction (may_retry=true)
%7|Thu Nov  3 09:02:40 2022.150|TXNERR|rdkafka#producer-1| [thrd:main]: Ignoring sub-sequent abortable transaction error: Failed to end transaction: Unknown broker error (UNKNOWN): previous error (REBALANCE_IN_PROGRESS) already raised
FATAL ERROR: Failed to abort transaction: UNKNOWN: EndTxn abort failed: Unknown broker error

rd_kafka_txn_set_abortable_error0 code

Client 2:

%7|Thu Nov  3 09:02:11 2022.404|FEATURE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Updated enabled protocol features to ApiVersion,BrokerBalancedConsumer,Sasl,SaslHandshake,BrokerGroupCoordinator,LZ4,OffsetTime,MsgVer2,IdempotentProducer,ZSTD,SaslAuthReq
%7|Thu Nov  3 09:02:11 2022.404|STATE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state APIVERSION_QUERY -> UP
%7|Thu Nov  3 09:02:11 2022.404|BROADCAST|rdkafka#producer-1| [thrd:TxnCoordinator]: Broadcasting state change
%7|Thu Nov  3 09:02:11 2022.404|METADATA|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Request metadata for brokers only: connected
%7|Thu Nov  3 09:02:11 2022.404|SEND|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Sent MetadataRequest (v4, 26 bytes @ 0, CorrId 2)
%7|Thu Nov  3 09:02:11 2022.404|COORD|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: Transaction coordinator is now up
%7|Thu Nov  3 09:02:11 2022.404|GETPID|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: Acquiring ProducerId
%7|Thu Nov  3 09:02:11 2022.404|IDEMPSTATE|rdkafka#producer-1| [thrd:main]: Idempotent producer state change WaitTransport -> WaitPID
%7|Thu Nov  3 09:02:11 2022.404|SEND|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Sent InitProducerIdRequest (v3, 55 bytes @ 0, CorrId 3)
%7|Thu Nov  3 09:02:11 2022.405|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Received MetadataResponse (v4, 136 bytes, CorrId 2, rtt 0.19ms)
%7|Thu Nov  3 09:02:11 2022.405|METADATA|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: ===== Received metadata: connected =====
%7|Thu Nov  3 09:02:11 2022.405|METADATA|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: ClusterId: redpanda.5f7a6195-f64c-41c8-a8a0-d02205a9b109, ControllerId: 1
%7|Thu Nov  3 09:02:11 2022.405|METADATA|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: 3 brokers, 0 topics
%7|Thu Nov  3 09:02:11 2022.405|METADATA|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1:   Broker #0/3: 10.142.65.19:9092 NodeId 1
%7|Thu Nov  3 09:02:11 2022.405|METADATA|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1:   Broker #1/3: 10.142.65.72:9092 NodeId 2
%7|Thu Nov  3 09:02:11 2022.405|METADATA|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1:   Broker #2/3: 10.142.65.207:9092 NodeId 3
%7|Thu Nov  3 09:02:11 2022.405|RECV|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Received InitProducerIdResponse (v3, 18 bytes, CorrId 3, rtt 0.64ms)
%7|Thu Nov  3 09:02:11 2022.405|GETPID|rdkafka#producer-1| [thrd:main]: Acquired PID{Id:14,Epoch:30}
%7|Thu Nov  3 09:02:11 2022.405|IDEMPSTATE|rdkafka#producer-1| [thrd:main]: Idempotent producer state change WaitPID -> Assigned
%7|Thu Nov  3 09:02:11 2022.405|TXNSTATE|rdkafka#producer-1| [thrd:main]: Transaction state change WaitPID -> ReadyNotAcked
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.207:9092/3: Wake-up: PID updated
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.72:9092/2: Wake-up: PID updated
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.19:9092/1: Wake-up: PID updated
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: Wake-up: PID updated
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: Wake-up sent to 4 broker threads in state >= INIT: PID updated
%7|Thu Nov  3 09:02:11 2022.405|TXNAPI|rdkafka#producer-1| [thrd:app]: Transactional API called: init_transactions (in txn state ReadyNotAcked, idemp state Assigned)
%7|Thu Nov  3 09:02:11 2022.405|TXNSTATE|rdkafka#producer-1| [thrd:main]: Transaction state change ReadyNotAcked -> Ready
%7|Thu Nov  3 09:02:11 2022.405|TXNSTATE|rdkafka#producer-1| [thrd:main]: Transaction state change Ready -> InTransaction
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.207:9092/3: Wake-up: begin transaction
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.72:9092/2: Wake-up: begin transaction
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: 10.142.65.19:9092/1: Wake-up: begin transaction
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: TxnCoordinator/1: Wake-up: begin transaction
%7|Thu Nov  3 09:02:11 2022.405|WAKEUP|rdkafka#producer-1| [thrd:main]: Wake-up sent to 4 broker threads in state >= INIT: begin transaction
%7|Thu Nov  3 09:02:11 2022.904|TXNCOORD|rdkafka#producer-1| [thrd:main]: Querying for transaction coordinator: Coordinator query timer
%7|Thu Nov  3 09:02:11 2022.904|SEND|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/1: Sent FindCoordinatorRequest (v2, 41 bytes @ 0, CorrId 4)
%7|Thu Nov  3 09:02:11 2022.904|RECV|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/1: Received FindCoordinatorResponse (v2, 30 bytes, CorrId 4, rtt 0.16ms)
%7|Thu Nov  3 09:02:11 2022.904|TXNCOORD|rdkafka#producer-1| [thrd:main]: 10.142.65.19:9092/1: FindCoordinator response: Transaction coordinator is broker 1 (10.142.65.19:9092)
%7|Thu Nov  3 09:02:43 2022.410|WAKEUPFD|rdkafka#producer-1| [thrd:main]: TxnCoordinator: Enabled low-latency ops queue wake-ups
%7|Thu Nov  3 09:02:43 2022.410|BRKMAIN|rdkafka#producer-1| [thrd::0/internal]: :0/internal: Enter main broker thread
%7|Thu Nov  3 09:02:43 2022.410|BROKER|rdkafka#producer-1| [thrd:main]: TxnCoordinator: Added new broker with NodeId -1
%7|Thu Nov  3 09:02:43 2022.410|BRKMAIN|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: Enter main broker thread
%7|Thu Nov  3 09:02:43 2022.410|STATE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state INIT -> TRY_CONNECT
%7|Thu Nov  3 09:02:43 2022.410|BROADCAST|rdkafka#producer-1| [thrd:TxnCoordinator]: Broadcasting state change
%7|Thu Nov  3 09:02:43 2022.410|WAKEUPFD|rdkafka#producer-1| [thrd:app]: 10.142.65.19:9092/bootstrap: Enabled low-latency ops queue wake-ups
%7|Thu Nov  3 09:02:43 2022.410|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|Thu Nov  3 09:02:43 2022.410|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|Thu Nov  3 09:02:43 2022.410|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|Thu Nov  3 09:02:43 2022.410|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|Thu Nov  3 09:02:43 2022.410|BROKER|rdkafka#producer-1| [thrd:app]: 10.142.65.19:9092/bootstrap: Added new broker with NodeId -1
%7|Thu Nov  3 09:02:43 2022.410|BRKMAIN|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/bootstrap: Enter main broker thread
%7|Thu Nov  3 09:02:43 2022.410|CONNECT|rdkafka#producer-1| [thrd:app]: 10.142.65.19:9092/bootstrap: Selected for cluster connection: bootstrap servers added (broker has 0 connection attempt(s))

Client 3:

%7|Thu Nov  3 09:02:26 2022.283|FEATURE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator/1: Updated enabled protocol features to ApiVersion,BrokerBalancedConsumer,Sasl,SaslHandshake,BrokerGroupCoordinator,LZ4,OffsetTime,MsgVer2,IdempotentProducer,ZSTD,SaslAuthReq
%7|Thu Nov  3 09:02:26 2022.283|STATE|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state APIVERSION_QUERY -> UP
%7|Thu Nov  3 09:02:26 2022.283|BROADCAST|rdkafka#producer-1| [thrd:TxnCoordinator]: Broadcasting state change
[...]
%7|Thu Nov  3 09:02:26 2022.782|SEND|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/1: Sent FindCoordinatorRequest (v2, 42 bytes @ 0, CorrId 4)
%7|Thu Nov  3 09:02:26 2022.783|RECV|rdkafka#producer-1| [thrd:10.142.65.19:9092/bootstrap]: 10.142.65.19:9092/1: Received FindCoordinatorResponse (v2, 30 bytes, CorrId 4, rtt 0.12ms)
%7|Thu Nov  3 09:02:26 2022.783|TXNCOORD|rdkafka#producer-1| [thrd:main]: 10.142.65.19:9092/1: FindCoordinator response: Transaction coordinator is broker 1 (10.142.65.19:9092)

How to reproduce the issue?

Run several instances of this kill-transactions along with a transactions which should be stable.

if [ "$#" -ne 1 ]; then
        echo "usage: $0 transaction_id"
        exit -1
fi
if [ ! -x ./examples/transactions ]; then
        echo "No transaction executable"
        exit -1
fi
export TZ=UTC
mkdir -p kill_txn_logs

while true; do
        #transaction_id="${1}"
        transaction_id="${1}_${RANDOM}_${RANDOM}"
        log_file="transactions_${transaction_id}_$(date +'%Y%m%d_%H%M')"
        echo "Starting transactions - $(date)"
        ./examples/transactions $REDPANDA_BROKERS txn_input txn_output ${transaction_id} > ${log_file} 2>&1 &
        #./examples/transactions $REDPANDA_BROKERS txn_input txn_output ${transaction_id} 2>&1 | tee -a kill_txn_logs/${log_file} &
        transaction_pid=$!
        sleep $(( $RANDOM % 10 + 30 ))
        echo "Killed"
        kill -9 $transaction_pid
        sleep 1
done

freef4ll avatar Nov 03 '22 10:11 freef4ll