redpanda
redpanda copied to clipboard
Umbrella issue for "redpanda falied to start in 20 sec"
There are several ci-failures related to CI failure all failed to start in 20 seconds, which I believe share the same underlying cause.
This is am umbrella issue for these so we have one stream of investigation. We can't close the other issues because of how PT works, but we can point them here.
A typical example:
TimeoutError('Redpanda service docker-rp-7 failed to start within 20 sec')
Traceback (most recent call last):
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
data = self.run_test()
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
return self.test_context.function(self.test)
File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/root/tests/rptest/services/cluster.py", line 103, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/partition_movement_test.py", line 971, in test_shadow_indexing
self.start_redpanda(num_nodes=3, install_opts=install_opts)
File "/root/tests/rptest/tests/end_to_end.py", line 125, in start_redpanda
self.redpanda.start(auto_assign_node_id=new_bootstrap,
File "/root/tests/rptest/services/redpanda.py", line 2552, in start
raise e
File "/root/tests/rptest/services/redpanda.py", line 2544, in start
self.for_nodes(to_start, start_one)
File "/root/tests/rptest/services/redpanda.py", line 1391, in for_nodes
return list(executor.map(cb, nodes))
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/root/tests/rptest/services/redpanda.py", line 2536, in start_one
self.start_node(node,
File "/root/tests/rptest/services/redpanda.py", line 2871, in start_node
self.start_service(node, start_rp)
ducktape.errors.TimeoutError: Redpanda service docker-rp-7 failed to start within 20 sec
Test requested 5 nodes, used only 3
JIRA Link: CORE-4276
Maybe we add a call to https://github.com/redpanda-data/redpanda/blob/5f075743fdc2659d599d44425f738432eb8d8fd2/tests/rptest/services/redpanda.py#L1548-L1570 on nodes that fail to start within 20 seconds? At least start with that, then we can hopefully further distinguish the reason for the 20 second time out.
That probably won't catch some of the above failures, but for the failure in #18845, it would have since it asserted at start up
Just noting that many/most of the linked issues have had their artifacts aged off.
I briefly spent some time looking at https://github.com/redpanda-data/redpanda/issues/18793 which still has logs.
In this instance, it looks like rp-14 successfully registers as node id 5, but its controller doesn't get replicated to for 20 seconds, and so the test times out. rp-3 is the leader at this time (s/rp-n/ducktape-node-n-nationally-ultimate-lamb). On that node, it appears to be trying to send rpcs
TRACE 2024-06-04 10:57:24,379 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 3, correlation_idx: 3, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,479 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 4, correlation_idx: 4, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,579 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 5, correlation_idx: 5, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,679 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 6, correlation_idx: 6, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,780 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 7, correlation_idx: 7, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,880 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 8, correlation_idx: 8, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,980 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 9, correlation_idx: 9, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
but it doesn't seem to take.
https://ci-artifacts.dev.vectorized.cloud/vtools/14337/018fe244-7bf3-4bd5-b6ff-43ccc1a311d4/vbuild/ducktape/results/2024-06-04--001/report.html
but it doesn't seem to take.
Do you mean the RPCs failed? Is that possible to determine from the log above?
Do you mean the RPCs failed? Is that possible to determine from the log above?
Kind of, it's a bit odd. there's not a lot of logging around RPCs for several (~20) seconds other than that "Dispatched request with sequence" line from rp-3 dispatching to rp-14. On rp-14, there doesn't look to be anything too suspicious related to rp-3 RPCs over the span of those 20 seconds:
TRACE 2024-06-04 10:57:24,082 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 1, correlation_idx: 1, pending queue_size: 0, target_address: {host: ducktape-node-3-nationally-ultimate-lamb, port: 33145}
INFO 2024-06-04 10:57:24,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
DEBUG 2024-06-04 10:57:24,093 [shard 0:main] rpc - transport.cc:400 - could not parse header from server: {host: ducktape-node-3-nationally-ultimate-lamb, port: 33145}
DEBUG 2024-06-04 10:57:24,093 [shard 0:main] rpc - transport.cc:471 - RPC Client to {host: ducktape-node-3-nationally-ultimate-lamb, port: 33145} probes: { requests_sent: 1, requests_pending: 0, requests_completed: 1, request_errors: 0, request_timeouts: 0, in_bytes: 2113, out_bytes: 412, connects: 1, connections: 0, connection_errors: 0, read_dispatch_errors: 0, corrupted_headers: 1, server_correlation_errors: 0, client_correlation_errors: 0, requests_blocked_memory: 0 }
TRACE 2024-06-04 10:57:24,107 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:51456 on ""
TRACE 2024-06-04 10:57:24,178 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:61476 on ""
TRACE 2024-06-04 10:57:25,836 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:52216 on ""
TRACE 2024-06-04 10:57:25,851 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:64980 on ""
TRACE 2024-06-04 10:57:25,861 [shard 3:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:54495 on ""
TRACE 2024-06-04 10:57:25,864 [shard 2:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:52510 on ""
TRACE 2024-06-04 10:57:25,869 [shard 3:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:57963 on ""
TRACE 2024-06-04 10:57:25,888 [shard 1:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:50089 on ""
TRACE 2024-06-04 10:57:25,897 [shard 1:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:54141 on ""
TRACE 2024-06-04 10:57:25,913 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:62664 on ""
TRACE 2024-06-04 10:57:25,923 [shard 3:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:57475 on ""
TRACE 2024-06-04 10:57:25,933 [shard 1:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:50909 on ""
TRACE 2024-06-04 10:57:25,948 [shard 2:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:58958 on ""
TRACE 2024-06-04 10:57:25,990 [shard 2:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:63066 on ""
INFO 2024-06-04 10:57:29,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
INFO 2024-06-04 10:57:34,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
INFO 2024-06-04 10:57:39,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
DEBUG 2024-06-04 10:57:42,624 [shard 0:main] rpc - could not parse header from client: 10.168.0.109:52216
TRACE 2024-06-04 10:57:42,624 [shard 0:main] rpc - server.cc:178 - shutting down connection 10.168.0.109:52216
Closing older-bot-filed CI issues as we transition to a more reliable system.