redpanda Umbrella issue for "redpanda falied to start in 20 sec"

There are several ci-failures related to CI failure all failed to start in 20 seconds, which I believe share the same underlying cause.

This is am umbrella issue for these so we have one stream of investigation. We can't close the other issues because of how PT works, but we can point them here.

A typical example:

TimeoutError('Redpanda service docker-rp-7 failed to start within 20 sec')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 103, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 971, in test_shadow_indexing
    self.start_redpanda(num_nodes=3, install_opts=install_opts)
  File "/root/tests/rptest/tests/end_to_end.py", line 125, in start_redpanda
    self.redpanda.start(auto_assign_node_id=new_bootstrap,
  File "/root/tests/rptest/services/redpanda.py", line 2552, in start
    raise e
  File "/root/tests/rptest/services/redpanda.py", line 2544, in start
    self.for_nodes(to_start, start_one)
  File "/root/tests/rptest/services/redpanda.py", line 1391, in for_nodes
    return list(executor.map(cb, nodes))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/services/redpanda.py", line 2536, in start_one
    self.start_node(node,
  File "/root/tests/rptest/services/redpanda.py", line 2871, in start_node
    self.start_service(node, start_rp)
ducktape.errors.TimeoutError: Redpanda service docker-rp-7 failed to start within 20 sec
Test requested 5 nodes, used only 3

JIRA Link: CORE-4276

Jun 23 '24 02:06 travisdowns

Maybe we add a call to https://github.com/redpanda-data/redpanda/blob/5f075743fdc2659d599d44425f738432eb8d8fd2/tests/rptest/services/redpanda.py#L1548-L1570 on nodes that fail to start within 20 seconds? At least start with that, then we can hopefully further distinguish the reason for the 20 second time out.

Jun 28 '24 13:06 michael-redpanda

That probably won't catch some of the above failures, but for the failure in #18845, it would have since it asserted at start up

Jun 28 '24 13:06 michael-redpanda

Just noting that many/most of the linked issues have had their artifacts aged off.

I briefly spent some time looking at https://github.com/redpanda-data/redpanda/issues/18793 which still has logs.

In this instance, it looks like rp-14 successfully registers as node id 5, but its controller doesn't get replicated to for 20 seconds, and so the test times out. rp-3 is the leader at this time (s/rp-n/ducktape-node-n-nationally-ultimate-lamb). On that node, it appears to be trying to send rpcs

TRACE 2024-06-04 10:57:24,379 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 3, correlation_idx: 3, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,479 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 4, correlation_idx: 4, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,579 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 5, correlation_idx: 5, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,679 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 6, correlation_idx: 6, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,780 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 7, correlation_idx: 7, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,880 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 8, correlation_idx: 8, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}
TRACE 2024-06-04 10:57:24,980 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 9, correlation_idx: 9, pending queue_size: 0, target_address: {host: ducktape-node-14-nationally-ultimate-lamb, port: 33145}

but it doesn't seem to take.

https://ci-artifacts.dev.vectorized.cloud/vtools/14337/018fe244-7bf3-4bd5-b6ff-43ccc1a311d4/vbuild/ducktape/results/2024-06-04--001/report.html

Jul 19 '24 16:07 andrwng

but it doesn't seem to take.

Do you mean the RPCs failed? Is that possible to determine from the log above?

Jul 19 '24 20:07 travisdowns

Do you mean the RPCs failed? Is that possible to determine from the log above?

Kind of, it's a bit odd. there's not a lot of logging around RPCs for several (~20) seconds other than that "Dispatched request with sequence" line from rp-3 dispatching to rp-14. On rp-14, there doesn't look to be anything too suspicious related to rp-3 RPCs over the span of those 20 seconds:

TRACE 2024-06-04 10:57:24,082 [shard 0:main] rpc - transport.cc:358 - Dispatched request with sequence: 1, correlation_idx: 1, pending queue_size: 0, target_address: {host: ducktape-node-3-nationally-ultimate-lamb, port: 33145}
INFO  2024-06-04 10:57:24,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
DEBUG 2024-06-04 10:57:24,093 [shard 0:main] rpc - transport.cc:400 - could not parse header from server: {host: ducktape-node-3-nationally-ultimate-lamb, port: 33145}
DEBUG 2024-06-04 10:57:24,093 [shard 0:main] rpc - transport.cc:471 - RPC Client to {host: ducktape-node-3-nationally-ultimate-lamb, port: 33145} probes: { requests_sent: 1, requests_pending: 0, requests_completed: 1, request_errors: 0, request_timeouts: 0, in_bytes: 2113, out_bytes: 412, connects: 1, connections: 0, connection_errors: 0, read_dispatch_errors: 0, corrupted_headers: 1, server_correlation_errors: 0, client_correlation_errors: 0, requests_blocked_memory: 0 }
TRACE 2024-06-04 10:57:24,107 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:51456 on ""
TRACE 2024-06-04 10:57:24,178 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:61476 on ""
TRACE 2024-06-04 10:57:25,836 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:52216 on ""
TRACE 2024-06-04 10:57:25,851 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:64980 on ""
TRACE 2024-06-04 10:57:25,861 [shard 3:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:54495 on ""
TRACE 2024-06-04 10:57:25,864 [shard 2:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:52510 on ""
TRACE 2024-06-04 10:57:25,869 [shard 3:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:57963 on ""
TRACE 2024-06-04 10:57:25,888 [shard 1:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:50089 on ""
TRACE 2024-06-04 10:57:25,897 [shard 1:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:54141 on ""
TRACE 2024-06-04 10:57:25,913 [shard 0:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.25:62664 on ""
TRACE 2024-06-04 10:57:25,923 [shard 3:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:57475 on ""
TRACE 2024-06-04 10:57:25,933 [shard 1:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:50909 on ""
TRACE 2024-06-04 10:57:25,948 [shard 2:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.109:58958 on ""
TRACE 2024-06-04 10:57:25,990 [shard 2:main] rpc - server.cc:273 - vectorized internal rpc protocol - Incoming connection from 10.168.0.90:63066 on ""
INFO  2024-06-04 10:57:29,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
INFO  2024-06-04 10:57:34,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
INFO  2024-06-04 10:57:39,082 [shard 0:main] cluster - config_manager.cc:597 - reconcile_status: failed sending status update to leader: rpc::errc::missing_node_rpc_client
DEBUG 2024-06-04 10:57:42,624 [shard 0:main] rpc - could not parse header from client: 10.168.0.109:52216
TRACE 2024-06-04 10:57:42,624 [shard 0:main] rpc - server.cc:178 - shutting down connection 10.168.0.109:52216

Jul 20 '24 04:07 andrwng

Closing older-bot-filed CI issues as we transition to a more reliable system.

Sep 24 '24 03:09 piyushredpanda

redpanda redpanda copied to clipboard

Umbrella issue for "redpanda falied to start in 20 sec"

redpanda
redpanda copied to clipboard