redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure (CalledProcessError) in `MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance`

Open BenPope opened this issue 2 years ago • 5 comments
trafficstars

https://buildkite.com/redpanda/redpanda/builds/33178#01895497-d9b5-46bb-ac46-4958a3a26232

Module: rptest.tests.leadership_transfer_test
Class:  MultiTopicAutomaticLeadershipBalancingTest
Method: test_topic_aware_rebalance
test_id:    rptest.tests.leadership_transfer_test.MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance
status:     FAIL
run time:   1 minute 28.049 seconds


    CalledProcessError(1, ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-9:9092,docker-rp-21:9092,docker-rp-6:9092', '--create', '--topic', 'topic-jagnbeovwa', '--partitions', '263', '--replication-factor', '3', '--config', 'cleanup.policy=delete'])
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 133, in run
    self.setup_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 218, in setup_test
    self.test.setup()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/test.py", line 91, in setup
    self.setUp()
  File "/root/tests/rptest/tests/redpanda_test.py", line 111, in setUp
    self._create_initial_topics()
  File "/root/tests/rptest/tests/redpanda_test.py", line 123, in _create_initial_topics
    client.create_topic(spec)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 84, in create_topic
    return self._run("kafka-topics.sh", args)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 402, in _run
    return self._execute(cmd)
  File "/root/tests/rptest/clients/kafka_cli_tools.py", line 412, in _execute
    res = subprocess.check_output(cmd,
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-9:9092,docker-rp-21:9092,docker-rp-6:9092', '--create', '--topic', 'topic-jagnbeovwa', '--partitions', '263', '--replication-factor', '3', '--config', 'cleanup.policy=delete']' returned non-zero exit status 1.

JIRA Link: CORE-1372

BenPope avatar Jul 17 '23 09:07 BenPope

Issue hasn't occurred for 2 months per Pandatriage; closing

piyushredpanda avatar Sep 26 '23 05:09 piyushredpanda

Happened here:

https://buildkite.com/redpanda/redpanda/builds/44428#018d55e9-a12f-4d21-a4ba-37b09988b9d6

travisdowns avatar Jan 30 '24 00:01 travisdowns

Here's the relevant part of the test log:

[DEBUG - 2024-01-29 16:58:58,694 - kafka_cli_tools - create_topic - lineno:102]: Creating topic: topic-vzgaqgtmlw
[DEBUG - 2024-01-29 16:58:58,694 - kafka_cli_tools - _execute - lineno:445]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-12:9092,docker-rp-10:9092,docker-rp-11:9092', '--create', '--topic', 'topic-vzgaqgtmlw', '--partitions', '61', '--replication-factor', '3', '--config', 'cleanup.policy=delete']
[DEBUG - 2024-01-29 16:59:05,188 - kafka_cli_tools - _execute - lineno:450]: Created topic topic-vzgaqgtmlw.

[DEBUG - 2024-01-29 16:59:05,188 - redpanda_test - _create_initial_topics - lineno:81]: Creating initial topic topic-xctbtsvodn
[DEBUG - 2024-01-29 16:59:05,188 - kafka_cli_tools - create_topic - lineno:102]: Creating topic: topic-xctbtsvodn
[DEBUG - 2024-01-29 16:59:05,189 - kafka_cli_tools - _execute - lineno:445]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-10:9092,docker-rp-11:9092,docker-rp-12:9092', '--create', '--topic', 'topic-xctbtsvodn', '--partitions', '151', '--replication-factor', '3', '--config', 'cleanup.policy=delete']
[DEBUG - 2024-01-29 16:59:24,432 - kafka_cli_tools - _execute - lineno:450]: Created topic topic-xctbtsvodn.

[DEBUG - 2024-01-29 16:59:24,432 - redpanda_test - _create_initial_topics - lineno:81]: Creating initial topic topic-hhftbmhjkt
[DEBUG - 2024-01-29 16:59:24,432 - kafka_cli_tools - create_topic - lineno:102]: Creating topic: topic-hhftbmhjkt
[DEBUG - 2024-01-29 16:59:24,432 - kafka_cli_tools - _execute - lineno:445]: Executing command: ['/opt/kafka-3.0.0/bin/kafka-topics.sh', '--bootstrap-server', 'docker-rp-10:9092,docker-rp-11:9092,docker-rp-12:9092', '--create', '--topic', 'topic-hhftbmhjkt', '--partitions', '263', '--replication-factor', '3', '--config', 'cleanup.policy=delete']
[DEBUG - 2024-01-29 17:00:25,816 - kafka_cli_tools - _execute - lineno:453]: Error (1) executing command: Error while executing topic command : Call(callName=createTopics, deadlineMs=1706547625424, tries=1, nextAllowedTryMs=1706547625578) timed out at 1706547625478 after 1 attempt(s)
[2024-01-29 17:00:25,482] ERROR org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, deadlineMs=1706547625424, tries=1, nextAllowedTryMs=1706547625578) timed out at 1706547625478 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.DisconnectException: Cancelled createTopics request with correlation id 3 due to node 1 being disconnected
 (kafka.admin.TopicCommand$)

It times out after 1 minute while creating the topic. I believe the Cancelled createTopics request with correlation id 3 due to node 1 being disconnected does not mean that the call failed because a node (broker) disconnected, but that the timeout happened, causing the client to disconnect and producing this second message as a side-effect.

We can see that the earlier calls to create topics with fewer numbers of partitions were also slow: creating 151 partitions took about 20 seconds, so it's taking ~1 minute to create 263 partitions would be less than twice as slow, so seems entirely plausible this is just slow.

travisdowns avatar Jan 30 '24 00:01 travisdowns

Based on the logs the nodes are in a bad state, perhaps especially rp-12: full of RPC timeouts. From the logs the client timeout occurs after 1 minute (as expected) and probably we have not successfully created the topic by that point.

Perhaps it is just too many partitions for the debug build cluster.

travisdowns avatar Jan 30 '24 01:01 travisdowns

This log line shows up:

TRACE 2024-01-29 16:59:55,735 [shard 0:main] cluster - controller_api.cc:176 - getting reconciliation state for {kafka/topic-hhftbmhjkt/262}

This means we got at least to the wait_for_topics part of the topic creation. Then either return api.wait_for_topic part hung (the log line appears once only for every partition and does not repeat, but we will repeat this every 100ms if it returns with a not-ready status), or we made it past that to return wait_for_leaders(md_cache, results, timeout), which may not have completed by the timeout. There is insufficient logging to determine which.

travisdowns avatar Jan 30 '24 02:01 travisdowns