redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Abort on out-of-memory in ducktape tests

Open travisdowns opened this issue 3 years ago • 7 comments

We now set --abort-on-seastar-bad-alloc for ducktape tests, which will cause the redpanda process to terminate with a diagnostic.

travisdowns avatar Aug 02 '22 22:08 travisdowns

logging configuration error: Unknown logger 'seastar_memory'. Use --help-loggers to list available loggers.

lol, oops, and no fair @dotnwat peeking at draft PRs ;).

This arg definitely works locally for plain redpanda binary, maybe there is something specific about the ducktape env that trips it up. Will dig in.

travisdowns avatar Aug 03 '22 17:08 travisdowns

Oh, I see the problem is that this logger doesn't exist in debug build because that build uses the libc allocator and the code that creates the seastar_memory logger object isn't even compiled in. I think the fix it just to at least create the logger unconditoinally, even if it logs nothing.

travisdowns avatar Aug 03 '22 17:08 travisdowns

This change depends on:

https://github.com/redpanda-data/seastar/pull/30

travisdowns avatar Aug 03 '22 19:08 travisdowns

This change is ready for review, but needs this other change in vtools to be merged first to solve the debug build failure:

https://github.com/redpanda-data/vtools/pull/869

travisdowns avatar Aug 05 '22 04:08 travisdowns

This needs a clustered ducktape run before merging to see which existing tests might be tripping bad_allocs routinely

jcsp avatar Aug 05 '22 08:08 jcsp

This needs a clustered ducktape run before merging to see which existing tests might be tripping bad_allocs routinely

@jcsp - seems reasonable, is that something I can get CI to do for me?

The only place where this would make things much worse would be tests that do a single very large allocation which itself causes a bad_alloc to be thrown even though the shard has lots of memory available. In this scenario, the bad_alloc always comes from the same place and the behavior could be more or less deterministic and it could be handled.

For tests that routinely run out of memory with more normally sized, modest allocations I guess they must already be flapping, so this should be an improvement for those tests.

travisdowns avatar Aug 06 '22 21:08 travisdowns

@jcsp - seems reasonable, is that something I can get CI to do for me?

Maybe, ask #help-devprod to write some instructions into the wiki if they didn't already.

However, I find it's usually better to run it directly yourself; that way if something failures you've got the cluster right there to play with, and you've got a shell into machines that already have the logs on instead of having to download a big tarball.

jcsp avatar Aug 08 '22 10:08 jcsp

lol, oops, and no fair @dotnwat peeking at draft PRs ;).

heh. it's just a fog of PRs

dotnwat avatar Aug 17 '22 23:08 dotnwat

This LGTM once we know whether it immediately fails on any of the nightly tests in scale_tests (also needs a rebase).

jcsp avatar Aug 18 '22 08:08 jcsp

I've rebased it and am working on a scale test run.

travisdowns avatar Aug 29 '22 22:08 travisdowns

@jcsp - I ran this change on 7 x i3en.xlarge nodes: it passed 16/19 scale tests and the remaining three did not run because they required 12 nodes. I also ran all the other nightly ec2 tests on this infra and did not find any failures related to this OOM change.

travisdowns avatar Aug 30 '22 22:08 travisdowns