redpanda
redpanda copied to clipboard
CI Failure (CalledProcessError) in `MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance`
https://buildkite.com/redpanda/redpanda/builds/33178#01895497-d9b5-46bb-ac46-4958a3a26232
Module: rptest.tests.leadership_transfer_test
Class: MultiTopicAutomaticLeadershipBalancingTest
Method: test_topic_aware_rebalance
test_id: rptest.tests.leadership_transfer_test.MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance
status: FAIL
run time: 1 minute 28.049 seconds
CalledProcessError(1, ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-9:9092,docker-rp-21:9092,docker-rp-6:9092', '--create', '--topic', 'topic-jagnbeovwa', '--partitions', '263', '--replication-factor', '3', '--config', 'cleanup.policy=delete'])
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 133, in run
self.setup_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 218, in setup_test
self.test.setup()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/test.py", line 91, in setup
self.setUp()
File "/root/tests/rptest/tests/redpanda_test.py", line 111, in setUp
self._create_initial_topics()
File "/root/tests/rptest/tests/redpanda_test.py", line 123, in _create_initial_topics
client.create_topic(spec)
File "/root/tests/rptest/clients/kafka_cli_tools.py", line 84, in create_topic
return self._run("kafka-topics.sh", args)
File "/root/tests/rptest/clients/kafka_cli_tools.py", line 402, in _run
return self._execute(cmd)
File "/root/tests/rptest/clients/kafka_cli_tools.py", line 412, in _execute
res = subprocess.check_output(cmd,
File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-9:9092,docker-rp-21:9092,docker-rp-6:9092', '--create', '--topic', 'topic-jagnbeovwa', '--partitions', '263', '--replication-factor', '3', '--config', 'cleanup.policy=delete']' returned non-zero exit status 1.
JIRA Link: CORE-1372
Issue hasn't occurred for 2 months per Pandatriage; closing
Happened here:
https://buildkite.com/redpanda/redpanda/builds/44428#018d55e9-a12f-4d21-a4ba-37b09988b9d6
Here's the relevant part of the test log:
[DEBUG - 2024-01-29 16:58:58,694 - kafka_cli_tools - create_topic - lineno:102]: Creating topic: topic-vzgaqgtmlw
[DEBUG - 2024-01-29 16:58:58,694 - kafka_cli_tools - _execute - lineno:445]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-12:9092,docker-rp-10:9092,docker-rp-11:9092', '--create', '--topic', 'topic-vzgaqgtmlw', '--partitions', '61', '--replication-factor', '3', '--config', 'cleanup.policy=delete']
[DEBUG - 2024-01-29 16:59:05,188 - kafka_cli_tools - _execute - lineno:450]: Created topic topic-vzgaqgtmlw.
[DEBUG - 2024-01-29 16:59:05,188 - redpanda_test - _create_initial_topics - lineno:81]: Creating initial topic topic-xctbtsvodn
[DEBUG - 2024-01-29 16:59:05,188 - kafka_cli_tools - create_topic - lineno:102]: Creating topic: topic-xctbtsvodn
[DEBUG - 2024-01-29 16:59:05,189 - kafka_cli_tools - _execute - lineno:445]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-10:9092,docker-rp-11:9092,docker-rp-12:9092', '--create', '--topic', 'topic-xctbtsvodn', '--partitions', '151', '--replication-factor', '3', '--config', 'cleanup.policy=delete']
[DEBUG - 2024-01-29 16:59:24,432 - kafka_cli_tools - _execute - lineno:450]: Created topic topic-xctbtsvodn.
[DEBUG - 2024-01-29 16:59:24,432 - redpanda_test - _create_initial_topics - lineno:81]: Creating initial topic topic-hhftbmhjkt
[DEBUG - 2024-01-29 16:59:24,432 - kafka_cli_tools - create_topic - lineno:102]: Creating topic: topic-hhftbmhjkt
[DEBUG - 2024-01-29 16:59:24,432 - kafka_cli_tools - _execute - lineno:445]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-10:9092,docker-rp-11:9092,docker-rp-12:9092', '--create', '--topic', 'topic-hhftbmhjkt', '--partitions', '263', '--replication-factor', '3', '--config', 'cleanup.policy=delete']
[DEBUG - 2024-01-29 17:00:25,816 - kafka_cli_tools - _execute - lineno:453]: Error (1) executing command: Error while executing topic command : Call(callName=createTopics, deadlineMs=1706547625424, tries=1, nextAllowedTryMs=1706547625578) timed out at 1706547625478 after 1 attempt(s)
[2024-01-29 17:00:25,482] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1706547625424, tries=1, nextAllowedTryMs=1706547625578) timed out at 1706547625478 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled createTopics request with correlation id 3 due to node 1 being disconnected
(kafka.admin.TopicCommand$)
It times out after 1 minute while creating the topic. I believe the Cancelled createTopics request with correlation id 3 due to node 1 being disconnected does not mean that the call failed because a node (broker) disconnected, but that the timeout happened, causing the client to disconnect and producing this second message as a side-effect.
We can see that the earlier calls to create topics with fewer numbers of partitions were also slow: creating 151 partitions took about 20 seconds, so it's taking ~1 minute to create 263 partitions would be less than twice as slow, so seems entirely plausible this is just slow.
Based on the logs the nodes are in a bad state, perhaps especially rp-12: full of RPC timeouts. From the logs the client timeout occurs after 1 minute (as expected) and probably we have not successfully created the topic by that point.
Perhaps it is just too many partitions for the debug build cluster.
This log line shows up:
TRACE 2024-01-29 16:59:55,735 [shard 0:main] cluster - controller_api.cc:176 - getting reconciliation state for {kafka/topic-hhftbmhjkt/262}
This means we got at least to the wait_for_topics part of the topic creation. Then either return api.wait_for_topic part hung (the log line appears once only for every partition and does not repeat, but we will repeat this every 100ms if it returns with a not-ready status), or we made it past that to return wait_for_leaders(md_cache, results, timeout), which may not have completed by the timeout. There is insufficient logging to determine which.