redpanda
redpanda copied to clipboard
BadLogLines triggering on OOM, obscuring the fact that RP crashed
Version & Environment
Redpanda version: tip of dev 2023/02/01
What went wrong?
CI failures that are caused by OOM are triggering a BadLogLines failure which obscures the "redpanda crashed" failure.
That is, redpanda crashed and while crashing it emitted a diagnostic, but the test failure refers to the diagnostic at ERROR level being the problem.
What should have happened instead?
Redpanda crashed flow should occur.
How to reproduce the issue?
Example failure:
https://buildkite.com/redpanda/vtools/builds/5596#01860452-135a-476d-ad15-637b394ea7ad
test_id: rptest.scale_tests.many_clients_test.ManyClientsTest.test_many_clients
status: FAIL
run time: 6 minutes 9.384 seconds
<BadLogLines nodes=ip-172-31-12-188(2) example="ERROR 2023-01-31 01:48:15,746 [shard 0] seastar_memory - Dumping seastar memory diagnostics">
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 67, in wrapped
self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1741, in raise_on_bad_logs
raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-12-188(2) example="ERROR 2023-01-31 01:48:15,746 [shard 0] seastar_memory - Dumping seastar memory diagnostics">
JIRA Link: CORE-1157