redpanda
redpanda copied to clipboard
More tests that exercise RPCs running across mixed versions
This PR adds tests that exercise RPC codepaths on clusters that span multiple versions. It leverages the recently introduced MixedVersionWorloadRunner
to run workloads that deterministically sends RPCs in a specific direction, swapping direction and the binary versions of sender and receiver. The primary goal is to ensure that RPCs sent between servers that have a mismatch in RPC serde support are able to communicate effectively. This PR is a best-effort attempt at getting coverage of a fair number of RPC types.
This includes all Raft traffic, transactions traffic, and some traffic targeted at the controller leader.
Notably missing is most traffic that is induced by the Kafka API:
- alter configs
- create/delete ACLs
These tend to get sent to the controller leader and need a bit more thought to trigger inter-server RPCs deterministically (perhaps by messing around with leadership, and using a client that doesn't aggressively refresh leadership metadata).
...as well as some that are not straightforward to send RPCs in a deterministic direction:
- join node
- feature actions/barriers
- partition movement (didn't exist pre-serde)
These may be done in follow-up patches. Also note that this is just one of the efforts in ensuring reasonable serialization -- another line of testing will serialize all message types from each version and check for compatibility.
@andrwng Is this ready for review or are you still making edits?
Failed test was https://github.com/redpanda-data/redpanda/issues/5358
Looks like there's still an issue with KafkaRPC
Fixed the Kafka RPC issue -- seems like upon restarting nodes, it's possible the rest of the cluster thinks a node A is leader, but it's been recently restarted and needs to establish leadership again before being able to properly service RPCs.
Also, what other RPCs do we need to exercise? For example, partition balancer and controller for nodes joining the cluster.
Updated the CL with some more context.
Also, per our recent standup discussion, can you do a ci-repeat-5 on this this is a test for important subsystems such as raft?
/ci-repeat 10
CI failures are existing pr-blockers:
It looks like the transaction test is failing now:
====================================================================================================
test_id: rptest.tests.transactions_test.MixedVersionTransactionsTest.test_txn_rpcs_with_upgrade
status: FAIL
run time: 2 minutes 10.358 seconds
KafkaException(KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"})
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/transactions_test.py", line 573, in test_txn_rpcs_with_upgrade
MixedVersionWorkloadRunner.upgrade_with_workload(
File "/root/tests/rptest/tests/upgrade_with_workload.py", line 51, in upgrade_with_workload
workload_fn(node0, node1)
File "/root/tests/rptest/tests/transactions_test.py", line 551, in txn_workload
run_txn(should_commit=True)
File "/root/tests/rptest/tests/transactions_test.py", line 520, in run_txn
producer.init_transactions()
cimpl.KafkaException: KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"}
This test passed last I ran it, so either I borked rebasing or something's been broken. @VadimPlh does anything look fishy in the transaction test to you? It goes through a partial upgrade, rollback, and full upgrade: https://github.com/redpanda-data/redpanda/blob/dev/tests/rptest/tests/upgrade_with_workload.py for reference.
It looks like the transaction test is failing now:
==================================================================================================== test_id: rptest.tests.transactions_test.MixedVersionTransactionsTest.test_txn_rpcs_with_upgrade status: FAIL run time: 2 minutes 10.358 seconds KafkaException(KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"}) Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run data = self.run_test() File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test return self.test_context.function(self.test) File "/root/tests/rptest/services/cluster.py", line 35, in wrapped r = f(self, *args, **kwargs) File "/root/tests/rptest/tests/transactions_test.py", line 573, in test_txn_rpcs_with_upgrade MixedVersionWorkloadRunner.upgrade_with_workload( File "/root/tests/rptest/tests/upgrade_with_workload.py", line 51, in upgrade_with_workload workload_fn(node0, node1) File "/root/tests/rptest/tests/transactions_test.py", line 551, in txn_workload run_txn(should_commit=True) File "/root/tests/rptest/tests/transactions_test.py", line 520, in run_txn producer.init_transactions() cimpl.KafkaException: KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"}
This test passed last I ran it, so either I borked rebasing or something's been broken. @VadimPlh does anything look fishy in the transaction test to you? It goes through a partial upgrade, rollback, and full upgrade: https://github.com/redpanda-data/redpanda/blob/dev/tests/rptest/tests/upgrade_with_workload.py for reference.
You used API which is not released. KIP-447 and kip-360 is not supported prev releases.
producer.send_offsets_to_transaction(
consumer.position(consumer.assignment()),
consumer.consumer_group_metadata())
And also if transaction was timeouted you should recreate producer
Also python client for kafka do not support well kip-360. So I prefered do not add test for kip-360 in ducktape and used java client for it and chaos test