redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

timeout in PartitionMovementUpgradeTest.test_basic_upgrade

Open andrwng opened this issue 3 years ago • 2 comments

CI failure https://buildkite.com/redpanda/redpanda/builds/13617#01826a0e-251d-46ae-a8e5-a9ff57e64800

====================================================================================================
test_id:    rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade
status:     FAIL
run time:   13 minutes 37.176 seconds


    TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableConsumerGroupConsumer-0-140299361888304 node 1 on docker-rp-22']")
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_upgrade_test.py", line 140, in test_basic_upgrade
    self.verify()
  File "/root/tests/rptest/tests/partition_movement_upgrade_test.py", line 71, in verify
    self.consumer.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableConsumerGroupConsumer-0-140299361888304 node 1 on docker-rp-22']

Interestingly enough happened on the release build, so I don't think there are any shenanigans with regards to debug vs release binaries, as seems to be the case with #5713

andrwng avatar Aug 04 '22 20:08 andrwng

A brief look into the test logs shows the following being logged for several minutes before failing:

[DEBUG - 2022-08-04 19:19:32,672 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:32Z" level=warning msg="error fetching topic-etfevawsts/12 metadata: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition."
[DEBUG - 2022-08-04 19:19:32,672 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:32Z" level=warning msg="Retrying getOffsets in 2s"
[DEBUG - 2022-08-04 19:19:34,673 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:34Z" level=info msg="Loading offsets for topic topic-etfevawsts t=-2..."
[DEBUG - 2022-08-04 19:19:34,689 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:34Z" level=warning msg="error fetching topic-etfevawsts/12 metadata: UNKNOWN_TOPIC_OR_PARTITION: This server does not host this topic-partition."
[DEBUG - 2022-08-04 19:19:34,689 - franz_go_verifiable_services - execute_cmd - lineno:74]: time="2022-08-04T19:19:34Z" level=warning msg="Retrying getOffsets in 2s"

Perhaps it's an issue with the verifier's means of refreshing metadata? Or perhaps there's a real bug with movement that isn't redirecting clients properly?

andrwng avatar Aug 04 '22 20:08 andrwng

This looks the same: https://ci-artifacts.dev.vectorized.cloud/vtools/01827b68-ebc9-4ff8-867f-d9461479bafe/vbuild/ducktape/results/2022-08-08--001/FranzGoVerifiableWithSiTest/test_si_without_timeboxed/segment_size=104857600/52/

[INFO  - 2022-08-08 03:48:11,377 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600: Summary: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableSeqConsumer-0-139971196335872 node 1 on ip-172-31-60-194']")
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/franz_go_verifiable_test.py", line 202, in test_si_without_timeboxed
    self._workload(segment_size)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/franz_go_verifiable_test.py", line 185, in _workload
    consumer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['FranzGoVerifiableSeqConsumer-0-139971196335872 node 1 on ip-172-31-60-194']

[INFO  - 2022-08-08 03:48:11,378 - runner_client - log - lineno:278]: RunnerClient: rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600: Data: None

BenPope avatar Aug 08 '22 09:08 BenPope

This may or may not be related, but there's a big rework of the kgo-verifier wrapper services https://github.com/redpanda-data/redpanda/pull/6059 here that should make them more robust (stop running them captively on an SSH channel, ping them for status via http instead, detect if they stop)

jcsp avatar Aug 16 '22 20:08 jcsp

Yeah that's a good observation. While we are probably no longer hanging in the test thread, it seems reasonable to expect something like #5990 coming in from the verifier threads since they all interact with the same node over SSH.

andrwng avatar Aug 16 '22 20:08 andrwng

John's kgo-verifier rework has landed https://github.com/redpanda-data/redpanda/pull/6059 so we shouldn't see these hangs anymore. This hasn't been reported in a couple weeks which is a good sign.

andrwng avatar Aug 30 '22 05:08 andrwng