redpanda
redpanda copied to clipboard
Upgrade tests fail to acquire install lock after unfinished run of PartitionMovementUpgradeTest
Version & Environment
Redpanda version: (use rpk version): dev
What went wrong?
CI failures https://buildkite.com/organizations/redpanda/pipelines/redpanda/builds/13698/jobs/01826f32-a447-4b67-9502-25fcfc3f4fe0
Several upgrade tests run fine until PartitionMovementUpgradeTest.test_basic_upgrade starts up, after which all upgrade tests fail with exceptions similar to:
[INFO:2022-08-05 19:50:02,144]: RunnerClient: rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2: FAIL: TimeoutError('')^M
Traceback (most recent call last):^M
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run^M
data = self.run_test()^M
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test^M
return self.test_context.function(self.test)^M
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper^M
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)^M
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped^M
r = f(self, *args, **kwargs)^M
File "/root/tests/rptest/tests/partition_movement_test.py", line 322, in test_move_consumer_offsets_intranode^M
self.start_redpanda(num_nodes=3,^M
File "/root/tests/rptest/tests/end_to_end.py", line 108, in start_redpanda^M
self.redpanda._installer.install(self.redpanda.nodes,^M
File "/root/tests/rptest/services/redpanda_installer.py", line 307, in install^M
self.start()^M
File "/root/tests/rptest/services/redpanda_installer.py", line 240, in start^M
self._acquire_install_lock()^M
File "/root/tests/rptest/services/redpanda_installer.py", line 171, in _acquire_install_lock^M
wait_until(lambda: _lock(), timeout_sec=timeout_sec)^M
File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until^M
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception^M
ducktape.errors.TimeoutError^
This particular test downloads the binaries in the setup phase, which appears to proceed successfully as indicated by the couple minutes spent before it starts running:
[INFO:2022-08-05 19:11:43,860]: RunnerClient: Loading test {'directory': '/root/tests/rptest/tests', 'file_name': 'partition_movement_upgrade_test.py', 'cls_name': 'PartitionMovementUpgradeTest', 'method_name': 'test_basic_upgrade', 'injected_args': None}^M
[INFO:2022-08-05 19:11:43,872]: RunnerClient:rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade: Setting up...^M
...
[INFO:2022-08-05 19:12:18,959]: RunnerClient: rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade: Running...
This is the only test in the logs we have that was not torn down:
~/Downloads$ grep "Setting up\|Tearing down" redpanda_build_13698_release-clang-amd64.log | cut -d " " -f 4 | sort | uniq -u
rptest.tests.partition_movement_upgrade_test.PartitionMovementUpgradeTest.test_basic_upgrade:
A few things come to mind as possible culprits:
- perhaps the test raised an exception or otherwise crashed maybe in another thread while the install lock was held, and somehow resulted in the
finallylock release not being executed - perhaps the body of the ssh commands (checking for existing installations, swapping symlinks) being run when running the the upgrades hung or otherwise failed
What should have happened instead?
PartitionMovementUpgradeTest.test_basic_upgradeshould complete- the tests should run independently of one another
- we should consider being a bit more verbose or using a different logger for test setup utilities so they end up in the runner logs instead of just being collected at the end
How to reproduce the issue?
- unclear
Additional information
Please attach any relevant logs, backtraces, or metric charts.
I have a branch that runs through a couple upgrades while running several franz go verifiable services. The test hangs pretty regularly; I'm not quite sure why, but it seems like the additional load makes hangs in our underlying SFTP client more likely, from GDB:
Traceback (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320, in wait
waiter.acquire()
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/buffered_pipe.py", line 160, in read
self._cv.wait(timeout)
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/channel.py", line 699, in recv
out = self.in_buffer.read(nbytes, self.timeout)
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp.py", line 185, in _read_all
x = self.sock.recv(n)
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp.py", line 201, in _read_packet
x = self._read_all(4)
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp_client.py", line 852, in _read_response
t, data = self._read_packet()
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp_client.py", line 822, in _request
return self._read_response(num)
File "/home/ubuntu/.local/lib/python3.10/site-packages/paramiko/sftp_client.py", line 511, in lstat
t, msg = self._request(CMD_LSTAT, path)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/cluster/remoteaccount.py", line 560, in exists
self.sftp_client.lstat(path)
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 329, in _install_unlocked
if not version == RedpandaInstaller.HEAD and not node.account.exists(
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda_installer.py", line 311, in install
self._install_unlocked(nodes, version)
Interestingly, when I replace instances of node.account.exists() with the equivalent command but with SSH the test doesn't hang at all:
def file_exists(node, file):
try:
node.account.ssh(f"stat {file}")
except RemoteCommandError as e:
return False
return True
The underlying root cause here is likely the call to brokers() in the franz go verifiable services, which under the hood used to call out to each node to read their configs with SFTP. https://github.com/redpanda-data/redpanda/pull/6048 changed this behavior so we shouldn't see this anymore.
We should be wary of similar issues though -- I think overall we do a pretty good job at making sure background threaded services get their own SSH client, but the fact that these issues can reveal themselves as a test hanging rather than something more obvious is troublesome. Perhaps we should have a hard requirement that nodes host no more than one background service at a time? Not sure if that'd be too heavy-handed.