redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

CI Failure (swarm hangs) in `OMBValidationTest.test_max_connections`

Open vbotbuildovich opened this issue 1 year ago • 9 comments
trafficstars

https://buildkite.com/redpanda/vtools/builds/11039 https://buildkite.com/redpanda/vtools/builds/11043 https://buildkite.com/redpanda/vtools/builds/11133

Module: rptest.redpanda_cloud_tests.omb_validation_test
Class: OMBValidationTest
Method: test_max_connections
test_id:    OMBValidationTest.test_max_connections
status:     FAIL
run time:   1625.388 seconds

TimeoutError(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 82, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/redpanda_cloud_tests/omb_validation_test.py", line 281, in test_max_connections
    s.wait(timeout_sec=5 * 60)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 287, in wait
    if not self.wait_node(node, end - now):
  File "/home/ubuntu/redpanda/tests/rptest/services/producer_swarm.py", line 111, in wait_node
    self._redpanda.wait_until(lambda: not self.is_alive(node),
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1226, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None

JIRA Link: CORE-1634

vbotbuildovich avatar Dec 13 '23 20:12 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11412

vbotbuildovich avatar Jan 12 '24 00:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11575 *https://buildkite.com/redpanda/vtools/builds/11579

vbotbuildovich avatar Jan 27 '24 01:01 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/11697

vbotbuildovich avatar Feb 04 '24 00:02 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/12361

vbotbuildovich avatar Mar 20 '24 02:03 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/13141

vbotbuildovich avatar Apr 24 '24 21:04 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/13245

vbotbuildovich avatar Apr 27 '24 21:04 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/13286

vbotbuildovich avatar Apr 29 '24 21:04 vbotbuildovich

The symptom is that 1 (or a few) producers seem to hang for 5 minutes while producing a message, then produce a few more messages then hang again for 5 minutes, repeat. These producer would take hours to finish their load and the swarm join times out long before that.

It is unclear why the producers fail in this way: there seem to be sufficient resources to run all the producers and there are no obvious errors from the Kafka layer (only the timeout error returned to Rust).

Example swarm log with 275 being the problematic producer:

[2024-04-23T23:00:43Z INFO  client_swarm::producers] Joining producer 274...
[2024-04-23T23:00:43Z INFO  client_swarm::producers] Joining producer 275...
[2024-04-23T23:01:06Z WARN  client_swarm::producers] Error on producer 275 87/660, producing 64 bytes, compression=none, compressible=false : Message production error: MessageTimedOut (Local: Message timed out)
[2024-04-23T23:01:06Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:06Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:07Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:08Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:09Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:10Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:11Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:12Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:13Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:14Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:15Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:16Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:17Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:18Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:19Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:20Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:21Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:22Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:23Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:24Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:25Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:26Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:27Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:28Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:29Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:30Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:31Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:32Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:01:33Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:33Z WARN  client_swarm::producers] Error on producer 275 116/660, producing 64 bytes, compression=none, compressible=false : Message production error: MessageTimedOut (Local: Message timed out)
[2024-04-23T23:06:33Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:33Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:34Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:35Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:36Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:37Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:38Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:39Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:40Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:41Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:42Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:43Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:44Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:45Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:46Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:47Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:48Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:49Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:50Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:51Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:52Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:53Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:54Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:55Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:56Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:57Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:58Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:06:59Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:07:00Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:07:01Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:07:02Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:07:03Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:12:03Z WARN  client_swarm::producers] Error on producer 275 148/660, producing 64 bytes, compression=none, compressible=false : Message production error: MessageTimedOut (Local: Message timed out)
[2024-04-23T23:12:03Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:12:03Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:12:04Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:12:05Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:12:06Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:12:07Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:17:07Z WARN  client_swarm::producers] Error on producer 275 154/660, producing 64 bytes, compression=none, compressible=false : Message production error: MessageTimedOut (Local: Message timed out)
[2024-04-23T23:17:07Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:17:07Z DEBUG client_swarm::producers] Producer 275 waiting
[2024-04-23T23:17:08Z DEBUG client_swarm::producers] Producer 275 waiting

travisdowns avatar Apr 30 '24 19:04 travisdowns

*https://buildkite.com/redpanda/vtools/builds/13444 *https://buildkite.com/redpanda/vtools/builds/13445

vbotbuildovich avatar May 03 '24 21:05 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/14144 *https://buildkite.com/redpanda/vtools/builds/14144

vbotbuildovich avatar May 29 '24 21:05 vbotbuildovich

*https://buildkite.com/redpanda/vtools/builds/14197

vbotbuildovich avatar May 31 '24 21:05 vbotbuildovich