redpanda
redpanda copied to clipboard
Abort on out-of-memory in ducktape tests
We now set --abort-on-seastar-bad-alloc for ducktape tests, which will cause the redpanda process to terminate with a diagnostic.
logging configuration error: Unknown logger 'seastar_memory'. Use --help-loggers to list available loggers.
lol, oops, and no fair @dotnwat peeking at draft PRs ;).
This arg definitely works locally for plain redpanda binary, maybe there is something specific about the ducktape env that trips it up. Will dig in.
Oh, I see the problem is that this logger doesn't exist in debug build because that build uses the libc allocator and the code that creates the seastar_memory logger object isn't even compiled in. I think the fix it just to at least create the logger unconditoinally, even if it logs nothing.
This change depends on:
https://github.com/redpanda-data/seastar/pull/30
This change is ready for review, but needs this other change in vtools to be merged first to solve the debug build failure:
https://github.com/redpanda-data/vtools/pull/869
This needs a clustered ducktape run before merging to see which existing tests might be tripping bad_allocs routinely
This needs a clustered ducktape run before merging to see which existing tests might be tripping bad_allocs routinely
@jcsp - seems reasonable, is that something I can get CI to do for me?
The only place where this would make things much worse would be tests that do a single very large allocation which itself causes a bad_alloc to be thrown even though the shard has lots of memory available. In this scenario, the bad_alloc always comes from the same place and the behavior could be more or less deterministic and it could be handled.
For tests that routinely run out of memory with more normally sized, modest allocations I guess they must already be flapping, so this should be an improvement for those tests.
@jcsp - seems reasonable, is that something I can get CI to do for me?
Maybe, ask #help-devprod to write some instructions into the wiki if they didn't already.
However, I find it's usually better to run it directly yourself; that way if something failures you've got the cluster right there to play with, and you've got a shell into machines that already have the logs on instead of having to download a big tarball.
lol, oops, and no fair @dotnwat peeking at draft PRs ;).
heh. it's just a fog of PRs
This LGTM once we know whether it immediately fails on any of the nightly tests in scale_tests (also needs a rebase).
I've rebased it and am working on a scale test run.
@jcsp - I ran this change on 7 x i3en.xlarge nodes: it passed 16/19 scale tests and the remaining three did not run because they required 12 nodes. I also ran all the other nightly ec2 tests on this infra and did not find any failures related to this OOM change.