redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

More tests that exercise RPCs running across mixed versions

Open andrwng opened this issue 2 years ago • 8 comments

This PR adds tests that exercise RPC codepaths on clusters that span multiple versions. It leverages the recently introduced MixedVersionWorloadRunner to run workloads that deterministically sends RPCs in a specific direction, swapping direction and the binary versions of sender and receiver. The primary goal is to ensure that RPCs sent between servers that have a mismatch in RPC serde support are able to communicate effectively. This PR is a best-effort attempt at getting coverage of a fair number of RPC types.

This includes all Raft traffic, transactions traffic, and some traffic targeted at the controller leader.

Notably missing is most traffic that is induced by the Kafka API:

  • alter configs
  • create/delete ACLs

These tend to get sent to the controller leader and need a bit more thought to trigger inter-server RPCs deterministically (perhaps by messing around with leadership, and using a client that doesn't aggressively refresh leadership metadata).

...as well as some that are not straightforward to send RPCs in a deterministic direction:

  • join node
  • feature actions/barriers
  • partition movement (didn't exist pre-serde)

These may be done in follow-up patches. Also note that this is just one of the efforts in ensuring reasonable serialization -- another line of testing will serialize all message types from each version and check for compatibility.

andrwng avatar Jul 22 '22 02:07 andrwng

@andrwng Is this ready for review or are you still making edits?

NyaliaLui avatar Aug 11 '22 18:08 NyaliaLui

Failed test was https://github.com/redpanda-data/redpanda/issues/5358

andrwng avatar Aug 18 '22 01:08 andrwng

Looks like there's still an issue with KafkaRPC

Fixed the Kafka RPC issue -- seems like upon restarting nodes, it's possible the rest of the cluster thinks a node A is leader, but it's been recently restarted and needs to establish leadership again before being able to properly service RPCs.

Also, what other RPCs do we need to exercise? For example, partition balancer and controller for nodes joining the cluster.

Updated the CL with some more context.

andrwng avatar Aug 18 '22 02:08 andrwng

Also, per our recent standup discussion, can you do a ci-repeat-5 on this this is a test for important subsystems such as raft?

NyaliaLui avatar Aug 19 '22 17:08 NyaliaLui

/ci-repeat 10

andrwng avatar Aug 23 '22 21:08 andrwng

It looks like the transaction test is failing now:

====================================================================================================
test_id:    rptest.tests.transactions_test.MixedVersionTransactionsTest.test_txn_rpcs_with_upgrade
status:     FAIL
run time:   2 minutes 10.358 seconds


    KafkaException(KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"})
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/transactions_test.py", line 573, in test_txn_rpcs_with_upgrade
    MixedVersionWorkloadRunner.upgrade_with_workload(
  File "/root/tests/rptest/tests/upgrade_with_workload.py", line 51, in upgrade_with_workload
    workload_fn(node0, node1)
  File "/root/tests/rptest/tests/transactions_test.py", line 551, in txn_workload
    run_txn(should_commit=True)
  File "/root/tests/rptest/tests/transactions_test.py", line 520, in run_txn
    producer.init_transactions()
cimpl.KafkaException: KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"}

This test passed last I ran it, so either I borked rebasing or something's been broken. @VadimPlh does anything look fishy in the transaction test to you? It goes through a partial upgrade, rollback, and full upgrade: https://github.com/redpanda-data/redpanda/blob/dev/tests/rptest/tests/upgrade_with_workload.py for reference.

andrwng avatar Sep 19 '22 16:09 andrwng

It looks like the transaction test is failing now:

====================================================================================================
test_id:    rptest.tests.transactions_test.MixedVersionTransactionsTest.test_txn_rpcs_with_upgrade
status:     FAIL
run time:   2 minutes 10.358 seconds


    KafkaException(KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"})
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/transactions_test.py", line 573, in test_txn_rpcs_with_upgrade
    MixedVersionWorkloadRunner.upgrade_with_workload(
  File "/root/tests/rptest/tests/upgrade_with_workload.py", line 51, in upgrade_with_workload
    workload_fn(node0, node1)
  File "/root/tests/rptest/tests/transactions_test.py", line 551, in txn_workload
    run_txn(should_commit=True)
  File "/root/tests/rptest/tests/transactions_test.py", line 520, in run_txn
    producer.init_transactions()
cimpl.KafkaException: KafkaError{code=INVALID_TXN_STATE,val=48,str="Failed to initialize Producer ID: Broker: Producer attempted a transactional operation in an invalid state"}

This test passed last I ran it, so either I borked rebasing or something's been broken. @VadimPlh does anything look fishy in the transaction test to you? It goes through a partial upgrade, rollback, and full upgrade: https://github.com/redpanda-data/redpanda/blob/dev/tests/rptest/tests/upgrade_with_workload.py for reference.

You used API which is not released. KIP-447 and kip-360 is not supported prev releases.

producer.send_offsets_to_transaction(
    consumer.position(consumer.assignment()),
    consumer.consumer_group_metadata())

And also if transaction was timeouted you should recreate producer

Also python client for kafka do not support well kip-360. So I prefered do not add test for kip-360 in ducktape and used java client for it and chaos test

VadimPlh avatar Sep 19 '22 16:09 VadimPlh