redpanda
redpanda copied to clipboard
Failure in `ManyClientsTest.test_many_clients`
Version & Environment
Redpanda version: (use rpk version
): dev
Link - https://buildkite.com/redpanda/vtools/builds/3894#018407f6-35fb-4132-82d9-eccf079f22d8/6-1930
Redpanda fails to stop in time.
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/home/ubuntu/redpanda/tests/rptest/scale_tests/many_clients_test.py", line 86, in test_many_clients
producer.wait()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/background_thread.py", line 72, in wait
super(BackgroundThreadService, self).wait(timeout_sec)
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/services/service.py", line 267, in wait
raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ProducerSwarm-0-139687700891296 node 1 on ip-172-31-15-155']
FAIL test: ManyClientsTest.test_many_clients (1/3 runs)
failure at 2022-10-31T09:05:46.627Z: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ProducerSwarm-0-139756753660672 node 1 on ip-172-31-52-92']")
in job https://buildkite.com/redpanda/vtools/builds/3988#01842c02-76f0-4568-bd38-a875e7dbc419
Looks like it expects to produce in 1s. Which doesn't seem like a lot with 4000 producers: https://github.com/redpanda-data/client-swarm/blob/main/src/main.rs#L113
https://github.com/redpanda-data/client-swarm/pull/4
Came back in https://buildkite.com/redpanda/vtools/builds/4182#01845f80-c176-4aa4-b55c-816b614deec6/6-7936 Maybe we need a larger timeout?
Came back in https://buildkite.com/redpanda/vtools/builds/4182#01845f80-c176-4aa4-b55c-816b614deec6/6-7936 Maybe we need a larger timeout?
5s still isn't very long, so I agree a longer timeout might be suitable. Worth checking the log.
Came back in https://buildkite.com/redpanda/vtools/builds/4182#01845f80-c176-4aa4-b55c-816b614deec6/6-7936 Maybe we need a larger timeout?
The test was "Creating new segment"s for 10mins. There are a few bursts of "Request timeout to".
@jcsp any thoughts? You introduced the test here: https://github.com/redpanda-data/redpanda/pull/4097
The segment roll messages aren't necessarily worrying, although we did introduce the change to disable segment size jitter by default since this test was originally created.
The request timeouts are the sign that the nodes are overloaded and failing to commit raft messages fast enough. Somewhat surprising with only 100 partitions. I could buy the idea that this is related to doing tiny segments, so let's rule that out as a possible cause.
Usually, if we get request timeouts from severely overloaded nodes, I'd expect there to be lots in flight, but at 08:01:11 all three nodes start reporting request timeouts with only 1 RPC in flight to their peer.
This test does intentionally run the system with very little memory (768MB for 2 shards), but there's no bad_alloc or anything like that. There are also no reactor stalls.
Basically it's as if something messed with the network for 1-2 minutes, and then things went back to normal. The test's producer wait is the default 10 minutes, and it's writing about 60GB of data, so this is a little tight for i3en.xlarge nodes to handle in that time period: the ~2 minute interruption would be sufficient to make it exceed that timeout.
The segment roll messages aren't necessarily worrying, although we did introduce the change to disable segment size jitter by default since this test was originally created.
I didn't mean to suggest they were worrying. Data is being written for minutes, the timeout for producing data is 5s (recently up from 1s). Is 5s a realistic timeout?
Is 5s a realistic timeout?
Yes, it should be: the test is aiming to validate that redpanda copes gracefully with this client count -- it isn't meant to overwhelm it. The 5s timeout is only the timeout for one single message production, the overall job of producing the data has a 600 second timeout (the default timeout for Service.wait).
Some next steps here could be:
- Rework producer_swarm.py to be a Service rather than BackgroundThreadService, and start the remote executable in the background with a local log file -- that helped a lot with KgoVerifierService, as
- Run this by hand and investigate how we're ending up with RPC timeouts (may be of interest to @travisdowns @ballard26 ) under a relatively modest 4000 consumer count.
It could be useful to also modify the test to fail if it sees RPC timeouts in the logs, so that we are failing any time the cluster appears overwhelmed, rather than experiencing this indirect failure from wait() timeout.
I see some BadLogLines errors here https://buildkite.com/redpanda/vtools/builds/4961#01857415-de4d-45ef-82aa-fd1e1e438729/6-7554
https://buildkite.com/redpanda/vtools/builds/5399#0185dda8-ff9a-47c1-8b69-82b48d8ab625
FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
failure at 2023-01-23T15:41:01.209Z: TimeoutError("Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['ProducerSwarm-0-139808240651184 node 1 on ip-172-31-11-147']")
Not repeating