redpanda Upgrade tests fail to acquire install lock after unfinished run of PartitionMovementUpgradeTest

Version & Environment

Redpanda version: (use rpk version): dev

What went wrong?

CI failures https://buildkite.com/organizations/redpanda/pipelines/redpanda/builds/13698/jobs/01826f32-a447-4b67-9502-25fcfc3f4fe0

Several upgrade tests run fine until PartitionMovementUpgradeTest.test_basic_upgrade starts up, after which all upgrade tests fail with exceptions similar to:

[INFO:2022-08-05 19:50:02,144]: RunnerClient: rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2: FAIL: TimeoutError('')^M
Traceback (most recent call last):^M
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run^M
    data = self.run_test()^M
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test^M
    return self.test_context.function(self.test)^M
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper^M
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)^M
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped^M
    r = f(self, *args, **kwargs)^M
  File "/root/tests/rptest/tests/partition_movement_test.py", line 322, in test_move_consumer_offsets_intranode^M
    self.start_redpanda(num_nodes=3,^M
  File "/root/tests/rptest/tests/end_to_end.py", line 108, in start_redpanda^M
    self.redpanda._installer.install(self.redpanda.nodes,^M
  File "/root/tests/rptest/services/redpanda_installer.py", line 307, in install^M
    self.start()^M
  File "/root/tests/rptest/services/redpanda_installer.py", line 240, in start^M
    self._acquire_install_lock()^M
  File "/root/tests/rptest/services/redpanda_installer.py", line 171, in _acquire_install_lock^M
    wait_until(lambda: _lock(), timeout_sec=timeout_sec)^M
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until^M
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception^M
ducktape.errors.TimeoutError^

This particular test downloads the binaries in the setup phase, which appears to proceed successfully as indicated by the couple minutes spent before it starts running:

[INFO:2022-08-05 19:11:43,860]: RunnerClient: Loading test {'directory': '/root/tests/rptest/tests', 'file_name': 'partition_movement_upgrade_test.py', 'cls_name': 'PartitionMovementUpgradeTest', 'method_name': 'test_basic_upgrade', 'injected_args': None}^M
[INFO:2022-08-05 19:11:43,872]: RunnerClient:rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade: Setting up...^M
...                                
[INFO:2022-08-05 19:12:18,959]: RunnerClient: rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade: Running...

This is the only test in the logs we have that was not torn down:

~/Downloads$ grep "Setting up\|Tearing down" redpanda_build_13698_release-clang-amd64.log  | cut -d " " -f 4 | sort | uniq -u
rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade:

A few things come to mind as possible culprits:

perhaps the test raised an exception or otherwise crashed maybe in another thread while the install lock was held, and somehow resulted in the finally lock release not being executed
perhaps the body of the ssh commands (checking for existing installations, swapping symlinks) being run when running the the upgrades hung or otherwise failed

What should have happened instead?

PartitionMovementUpgradeTest.test_basic_upgrade should complete
the tests should run independently of one another
we should consider being a bit more verbose or using a different logger for test setup utilities so they end up in the runner logs instead of just being collected at the end

How to reproduce the issue?

unclear

Additional information

Please attach any relevant logs, backtraces, or metric charts.

Aug 09 '22 00:08 andrwng

I have a branch that runs through a couple upgrades while running several franz go verifiable services. The test hangs pretty regularly; I'm not quite sure why, but it seems like the additional load makes hangs in our underlying SFTP client more likely, from GDB:

Traceback (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/buffered_pipe.py", line 160, in read
    self._cv.wait(timeout)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/channel.py", line 699, in recv
    out = self.in_buffer.read(nbytes, self.timeout)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp.py", line 185, in _read_all
    x = self.sock.recv(n)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp.py", line 201, in _read_packet
    x = self._read_all(4)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp_client.py", line 852, in _read_response
    t, data = self._read_packet()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp_client.py", line 822, in _request
    return self._read_response(num)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp_client.py", line 511, in lstat
    t, msg = self._request(CMD_LSTAT, path)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 560, in exists
    self.sftp_client.lstat(path)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 329, in _install_unlocked
    if not version == RedpandaInstaller.HEAD and not node.account.exists(
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 311, in install
    self._install_unlocked(nodes, version)

Interestingly, when I replace instances of node.account.exists() with the equivalent command but with SSH the test doesn't hang at all:

def file_exists(node, file):
    try:
        node.account.ssh(f"stat {file}")
    except RemoteCommandError as e:
        return False
    return True

Aug 12 '22 16:08 andrwng

The underlying root cause here is likely the call to brokers() in the franz go verifiable services, which under the hood used to call out to each node to read their configs with SFTP. https://github.com/redpanda-data/redpanda/pull/6048 changed this behavior so we shouldn't see this anymore.

We should be wary of similar issues though -- I think overall we do a pretty good job at making sure background threaded services get their own SSH client, but the fact that these issues can reveal themselves as a test hanging rather than something more obvious is troublesome. Perhaps we should have a hard requirement that nodes host no more than one background service at a time? Not sure if that'd be too heavy-handed.

Aug 18 '22 00:08 andrwng

redpanda redpanda copied to clipboard

Upgrade tests fail to acquire install lock after unfinished run of PartitionMovementUpgradeTest

Version & Environment

What went wrong?

What should have happened instead?

How to reproduce the issue?

Additional information

redpanda
redpanda copied to clipboard