CI Failure (`TimeoutError: Node docker-rp-xx draining leaderships`) in `BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth`
https://buildkite.com/redpanda/redpanda/builds/27277#01878dc4-9515-414e-88a6-89d717325c82
Module: rptest.tests.pandaproxy_test
Class: BasicAuthUpgradeTest
Method: test_upgrade_and_enable_basic_auth
Arguments:
{
"base_release": [
22,
3
],
"next_release": [
23,
1
]
}
====================================================================================================
test_id: rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.3.next_release=.23.1
status: FAIL
run time: 54.194 seconds
TimeoutError('Node docker-rp-23 draining leaderships')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/pandaproxy_test.py", line 2127, in test_upgrade_and_enable_basic_auth
self.redpanda.rolling_restart_nodes(self.redpanda.nodes)
File "/root/tests/rptest/services/redpanda.py", line 2289, in rolling_restart_nodes
restarter.restart_nodes(nodes,
File "/root/tests/rptest/services/rolling_restarter.py", line 88, in restart_nodes
wait_until(lambda: has_drained_leaders(node),
File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Node docker-rp-23 draining leaderships
Probably related https://github.com/redpanda-data/redpanda/issues/9049
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29125#018818fc-3f05-4301-a16c-3031c6b5d247
A very much similar failure in https://buildkite.com/redpanda/redpanda/builds/29432#01883193-77a7-4f46-bac1-be8fac94ff52
test_id: rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False
status: FAIL
run time: 1 minute 31.113 seconds
TimeoutError('Node docker-rp-23 draining leaderships')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 49, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/cluster_bootstrap_test.py", line 137, in test_change_bootstrap_configs_during_upgrade
self.redpanda.rolling_restart_nodes(self.redpanda.nodes,
File "/root/tests/rptest/services/redpanda.py", line 953, in rolling_restart_nodes
restarter.restart_nodes(nodes,
File "/root/tests/rptest/services/rolling_restarter.py", line 88, in restart_nodes
wait_until(lambda: has_drained_leaders(node),
File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Node docker-rp-23 draining leaderships
Log excerpts from docker-rp-23 node_id 1.
DEBUG - 2023-05-19 02:04:33,890 - admin - _request - lineno:332]: Dispatching put http://docker-rp-23:9644/v1/brokers/1/maintenance
DEBUG 2023-05-19 02:04:33,893 [shard 0] admin_api_server - admin_server.cc:365 - [admin] PUT http://docker-rp-23:9644/v1/brokers/1/maintenance
INFO 2023-05-19 02:04:33,894 [shard 0] cluster - members_table.cc:221 - marking node 1 in maintenance state
INFO 2023-05-19 02:04:33,894 [shard 1] cluster - members_table.cc:221 - marking node 1 in maintenance state
...
INFO 2023-05-19 02:04:33,894 [shard 0] cluster - drain_manager.cc:49 - Node draining is starting
INFO 2023-05-19 02:04:33,894 [shard 0] cluster - drain_manager.cc:143 - Node draining has started
INFO 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2946 - Starting leadership transfer from {id: {1}, revision: {0}} to {id: {3}, revision: {0}} in term 1
TRACE 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2801 - transfer leadership: preparing target={id: {3}, revision: {0}}, dirty_offset=21
TRACE 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2807 - transfer leadership: cleared oplock
DEBUG 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2839 - transfer leadership: node {id: {3}, revision: {0}} doesn't need recovery or is already recovering (is_recovering false dirty offset 21)
DEBUG 2023-05-19 02:04:33,894 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2872 - transfer leadership: node {id: {3}, revision: {0}} is not recovering, proceeding (dirty offset 21)
INFO 2023-05-19 02:04:33,894 [shard 0] cluster - drain_manager.cc:207 - Draining leadership from 1 groups
INFO 2023-05-19 02:04:33,894 [shard 1] cluster - drain_manager.cc:49 - Node draining is starting
INFO 2023-05-19 02:04:33,894 [shard 1] cluster - drain_manager.cc:143 - Node draining has started
INFO 2023-05-19 02:04:33,894 [shard 1] cluster - drain_manager.cc:256 - Node draining has completed on shard 1
TRACE 2023-05-19 02:04:33,894 [shard 0] storage - readers_cache.cc:328 - {redpanda/controller/0} - removing reader: [0,21] lower_bound: 22
[DEBUG - 2023-05-19 02:04:33,895 - admin - _request - lineno:361]: Response OK
[INFO - 2023-05-19 02:04:33,895 - rolling_restarter - restart_nodes - lineno:86]: Waiting for node docker-rp-23 leadership drain
[DEBUG - 2023-05-19 02:04:33,895 - admin - _request - lineno:332]: Dispatching get http://docker-rp-23:9644/v1/brokers/1
INFO 2023-05-19 02:04:33,895 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:187 - [leadership_transfer] Stepping down as leader in term 1, dirty offset 21
DEBUG 2023-05-19 02:04:33,895 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:2618 - triggering leadership notification with term: 1, new leader: {nullopt}
DEBUG 2023-05-19 02:04:33,895 [shard 0] cluster - health_monitor_backend.cc:395 - aborting current refresh request to 1
DEBUG 2023-05-19 02:04:33,895 [shard 0] cluster - feature_manager.cc:117 - Controller leader notification term 1
TRACE 2023-05-19 02:04:33,895 [shard 0] cluster - partition_leaders_table.cc:160 - updated partition: {redpanda/controller/0} leader: {term: 1, current leader: {nullopt}, previous leader: {1}, revision: 0}
INFO 2023-05-19 02:04:33,895 [shard 0] cluster - drain_manager.cc:239 - Draining leadership from 1 groups 1 succeeded
INFO 2023-05-19 02:04:33,895 [shard 0] cluster - drain_manager.cc:256 - Node draining has completed on shard 0
TRACE 2023-05-19 02:04:33,895 [shard 1] cluster - partition_leaders_table.cc:160 - updated partition: {redpanda/controller/0} leader: {term: 1, current leader: {nullopt}, previous leader: {1}, revision: 0}
TRACE 2023-05-19 02:04:33,898 [shard 0] request_auth - request_auth.cc:126 - Authenticated user {admin}
DEBUG 2023-05-19 02:04:33,898 [shard 0] admin_api_server - admin_server.cc:365 - [admin] GET http://docker-rp-23:9644/v1/brokers/1
[DEBUG - 2023-05-19 02:04:33,899 - admin - _request - lineno:355]: Response OK, JSON: {
'node_id': 1,
'num_cores': 2,
'membership_status': 'active',
'maintenance_status': {
'draining': False,
'finished': False,
'errors': False,
'partitions': 0,
'eligible': 0,
'transferring': 0,
'failed': 0
}
}
Whereas
def has_drained_leaders(node):
try:
node_id = self.redpanda.idx(node)
broker_resp = admin.get_broker(node_id, node=node)
maintenance_status = broker_resp["maintenance_status"]
return maintenance_status["draining"] and maintenance_status["finished"]
So it looks like node draining has finished before the test gets the draining status for the 1st time. The test wants to see the draining status before moving forward but misses it, and timeouts.
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29360#01882d69-f998-49ce-a0c8-4f6bde5a6463
Sounds like more a test failure and hence sev/low to me. @michael-redpanda could you keep me honest, please?
https://buildkite.com/redpanda/redpanda/builds/37651#018ad049-05dd-4d29-a1b5-b6fc8bba64c4
reported as stale by pandatriage
*https://buildkite.com/redpanda/redpanda/builds/40369
*https://buildkite.com/redpanda/vtools/builds/10446
*https://buildkite.com/redpanda/vtools/builds/11146
*https://buildkite.com/redpanda/vtools/builds/11190
*https://buildkite.com/redpanda/redpanda/builds/43078
*https://buildkite.com/redpanda/vtools/builds/11257
*https://buildkite.com/redpanda/redpanda/builds/44935
*https://buildkite.com/redpanda/redpanda/builds/46544
*https://buildkite.com/redpanda/vtools/builds/12656
*https://buildkite.com/redpanda/vtools/builds/13139
*https://buildkite.com/redpanda/vtools/builds/13519
*https://buildkite.com/redpanda/vtools/builds/13656
*https://buildkite.com/redpanda/redpanda/builds/49573
*https://buildkite.com/redpanda/vtools/builds/14105
*https://buildkite.com/redpanda/vtools/builds/13139 *https://buildkite.com/redpanda/vtools/builds/13519 *https://buildkite.com/redpanda/vtools/builds/13656 *https://buildkite.com/redpanda/redpanda/builds/49573 *https://buildkite.com/redpanda/vtools/builds/14105
*https://buildkite.com/redpanda/vtools/builds/13139 *https://buildkite.com/redpanda/vtools/builds/13519 *https://buildkite.com/redpanda/vtools/builds/13656 *https://buildkite.com/redpanda/redpanda/builds/49573 *https://buildkite.com/redpanda/vtools/builds/14105
*https://buildkite.com/redpanda/vtools/builds/13139 *https://buildkite.com/redpanda/vtools/builds/13519 *https://buildkite.com/redpanda/vtools/builds/13656 *https://buildkite.com/redpanda/redpanda/builds/49573 *https://buildkite.com/redpanda/vtools/builds/14105
*https://buildkite.com/redpanda/redpanda/builds/51106
*https://buildkite.com/redpanda/redpanda/builds/51106
*https://buildkite.com/redpanda/redpanda/builds/51183
*https://buildkite.com/redpanda/vtools/builds/15844
*https://buildkite.com/redpanda/vtools/builds/16241
*https://buildkite.com/redpanda/vtools/builds/16399