redpanda Failure in `ManyClientsTest.test_many

Version & Environment

Redpanda version: (use rpk version): dev

Link - https://buildkite.com/redpanda/vtools/builds/3894#018407f6-35fb-4132-82d9-eccf079f22d8/6-1930

Redpanda fails to stop in time.

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_clients_test.py", line 86, in test_many_clients
    producer.wait()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ProducerSwarm-0-139687700891296 node 1 on ip-172-31-15-155']

Oct 24 '22 14:10 Lazin

FAIL test: ManyClientsTest.test_many_clients (1/3 runs)

failure at 2022-10-31T09:05:46.627Z: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ProducerSwarm-0-139756753660672 node 1 on ip-172-31-52-92']")

in job https://buildkite.com/redpanda/vtools/builds/3988#01842c02-76f0-4568-bd38-a875e7dbc419

Oct 31 '22 18:10 dlex

Looks like it expects to produce in 1s. Which doesn't seem like a lot with 4000 producers: https://github.com/redpanda-data/client-swarm/blob/main/src/main.rs#L113

Nov 07 '22 13:11 BenPope

https://github.com/redpanda-data/client-swarm/pull/4

Nov 07 '22 13:11 BenPope

Came back in https://buildkite.com/redpanda/vtools/builds/4182#01845f80-c176-4aa4-b55c-816b614deec6/6-7936 Maybe we need a larger timeout?

Nov 10 '22 15:11 NyaliaLui

Came back in https://buildkite.com/redpanda/vtools/builds/4182#01845f80-c176-4aa4-b55c-816b614deec6/6-7936 Maybe we need a larger timeout?

5s still isn't very long, so I agree a longer timeout might be suitable. Worth checking the log.

Nov 10 '22 17:11 BenPope

Came back in https://buildkite.com/redpanda/vtools/builds/4182#01845f80-c176-4aa4-b55c-816b614deec6/6-7936 Maybe we need a larger timeout?

The test was "Creating new segment"s for 10mins. There are a few bursts of "Request timeout to".

@jcsp any thoughts? You introduced the test here: https://github.com/redpanda-data/redpanda/pull/4097

Nov 14 '22 10:11 BenPope

The segment roll messages aren't necessarily worrying, although we did introduce the change to disable segment size jitter by default since this test was originally created.

The request timeouts are the sign that the nodes are overloaded and failing to commit raft messages fast enough. Somewhat surprising with only 100 partitions. I could buy the idea that this is related to doing tiny segments, so let's rule that out as a possible cause.

Usually, if we get request timeouts from severely overloaded nodes, I'd expect there to be lots in flight, but at 08:01:11 all three nodes start reporting request timeouts with only 1 RPC in flight to their peer.

This test does intentionally run the system with very little memory (768MB for 2 shards), but there's no bad_alloc or anything like that. There are also no reactor stalls.

Basically it's as if something messed with the network for 1-2 minutes, and then things went back to normal. The test's producer wait is the default 10 minutes, and it's writing about 60GB of data, so this is a little tight for i3en.xlarge nodes to handle in that time period: the ~2 minute interruption would be sufficient to make it exceed that timeout.

Nov 14 '22 17:11 jcsp

The segment roll messages aren't necessarily worrying, although we did introduce the change to disable segment size jitter by default since this test was originally created.

I didn't mean to suggest they were worrying. Data is being written for minutes, the timeout for producing data is 5s (recently up from 1s). Is 5s a realistic timeout?

Nov 14 '22 18:11 BenPope

Is 5s a realistic timeout?

Yes, it should be: the test is aiming to validate that redpanda copes gracefully with this client count -- it isn't meant to overwhelm it. The 5s timeout is only the timeout for one single message production, the overall job of producing the data has a 600 second timeout (the default timeout for Service.wait).

Some next steps here could be:

Rework producer_swarm.py to be a Service rather than BackgroundThreadService, and start the remote executable in the background with a local log file -- that helped a lot with KgoVerifierService, as
Run this by hand and investigate how we're ending up with RPC timeouts (may be of interest to @travisdowns @ballard26 ) under a relatively modest 4000 consumer count.

It could be useful to also modify the test to fail if it sees RPC timeouts in the logs, so that we are failing any time the cluster appears overwhelmed, rather than experiencing this indirect failure from wait() timeout.

Nov 25 '22 11:11 jcsp

I see some BadLogLines errors here https://buildkite.com/redpanda/vtools/builds/4961#01857415-de4d-45ef-82aa-fd1e1e438729/6-7554

Jan 05 '23 11:01 Lazin

https://buildkite.com/redpanda/vtools/builds/5399#0185dda8-ff9a-47c1-8b69-82b48d8ab625

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-23T15:41:01.209Z: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ProducerSwarm-0-139808240651184 node 1 on ip-172-31-11-147']")

Jan 24 '23 07:01 rystsov

Not repeating

Apr 03 '23 12:04 ZeDRoman

redpanda
redpanda copied to clipboard

Failure in `ManyClientsTest.test_many_clients`

Version & Environment

redpanda redpanda copied to clipboard

Failure in `ManyClientsTest.test_many_clients`

Version & Environment

redpanda
redpanda copied to clipboard