redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

BadLogLines triggering on OOM, obscuring the fact that RP crashed

Open travisdowns opened this issue 2 years ago • 3 comments

Version & Environment

Redpanda version: tip of dev 2023/02/01

What went wrong?

CI failures that are caused by OOM are triggering a BadLogLines failure which obscures the "redpanda crashed" failure.

That is, redpanda crashed and while crashing it emitted a diagnostic, but the test failure refers to the diagnostic at ERROR level being the problem.

What should have happened instead?

Redpanda crashed flow should occur.

How to reproduce the issue?

Example failure:

https://buildkite.com/redpanda/vtools/builds/5596#01860452-135a-476d-ad15-637b394ea7ad

test_id:    rptest.scale_tests.many_clients_test.ManyClientsTest.test_many_clients
status:     FAIL
run time:   6 minutes 9.384 seconds


    <BadLogLines nodes=ip-172-31-12-188(2) example="ERROR 2023-01-31 01:48:15,746 [shard 0] seastar_memory - Dumping seastar memory diagnostics">
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/home/ubuntu/redpanda/tests/rptest/services/cluster.py", line 67, in wrapped
    self.redpanda.raise_on_bad_logs(allow_list=log_allow_list)
  File "/home/ubuntu/redpanda/tests/rptest/services/redpanda.py", line 1741, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-12-188(2) example="ERROR 2023-01-31 01:48:15,746 [shard 0] seastar_memory - Dumping seastar memory diagnostics">

JIRA Link: CORE-1157

travisdowns avatar Feb 01 '23 22:02 travisdowns