redpanda
redpanda copied to clipboard
timeout in PartitionMovementUpgradeTest.test_basic_upgrade
CI failure https://buildkite.com/redpanda/redpanda/builds/13617#01826a0e-251d-46ae-a8e5-a9ff57e64800
====================================================================================================
test_id: rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade
status: FAIL
run time: 13 minutes 37.176 seconds
TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableConsumerGroupConsumer-0-140299361888304 node 1 on docker-rp-22']")
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/partition_movement_upgrade_test.py", line 140, in test_basic_upgrade
self.verify()
File "/root/tests/rptest/tests/partition_movement_upgrade_test.py", line 71, in verify
self.consumer.wait()
File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 72, in wait
super(BackgroundThreadService, self).wait(timeout_sec)
File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 267, in wait
raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableConsumerGroupConsumer-0-140299361888304 node 1 on docker-rp-22']
Interestingly enough happened on the release build, so I don't think there are any shenanigans with regards to debug vs release binaries, as seems to be the case with #5713
A brief look into the test logs shows the following being logged for several minutes before failing:
[DEBUG - 2022-08-04 19:19:32,672 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:32Z" level=warning msg="error fetching topic-etfevawsts/12 metadata: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition."
[DEBUG - 2022-08-04 19:19:32,672 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:32Z" level=warning msg="Retrying getOffsets in 2s"
[DEBUG - 2022-08-04 19:19:34,673 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:34Z" level=info msg="Loading offsets for topic topic-etfevawsts t=-2..."
[DEBUG - 2022-08-04 19:19:34,689 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:34Z" level=warning msg="error fetching topic-etfevawsts/12 metadata: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition."
[DEBUG - 2022-08-04 19:19:34,689 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:34Z" level=warning msg="Retrying getOffsets in 2s"
Perhaps it's an issue with the verifier's means of refreshing metadata? Or perhaps there's a real bug with movement that isn't redirecting clients properly?
This looks the same: https://ci-artifacts.dev.vectorized.cloud/vtools/01827b68-ebc9-4ff8-867f-d9461479bafe/vbuild/ducktape/results/2022-08-08--001/FranzGoVerifiableWithSiTest/test_si_without_timeboxed/segment_size=104857600/52/
[INFO - 2022-08-08 03:48:11,377 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600: Summary: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableSeqConsumer-0-139971196335872 node 1 on ip-172-31-60-194']")
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/scale_tests/franz_go_verifiable_test.py", line 202, in test_si_without_timeboxed
self._workload(segment_size)
File "/home/ubuntu/redpanda/tests/rptest/scale_tests/franz_go_verifiable_test.py", line 185, in _workload
consumer.wait()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 72, in wait
super(BackgroundThreadService, self).wait(timeout_sec)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 267, in wait
raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableSeqConsumer-0-139971196335872 node 1 on ip-172-31-60-194']
[INFO - 2022-08-08 03:48:11,378 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600: Data: None
This may or may not be related, but there's a big rework of the kgo-verifier wrapper services https://github.com/redpanda-data/redpanda/pull/6059 here that should make them more robust (stop running them captively on an SSH channel, ping them for status via http instead, detect if they stop)
Yeah that's a good observation. While we are probably no longer hanging in the test thread, it seems reasonable to expect something like #5990 coming in from the verifier threads since they all interact with the same node over SSH.
John's kgo-verifier rework has landed https://github.com/redpanda-data/redpanda/pull/6059 so we shouldn't see these hangs anymore. This hasn't been reported in a couple weeks which is a good sign.