neon icon indicating copy to clipboard operation
neon copied to clipboard

test suite: eliminate bug class "stray processes after test exits"

Open problame opened this issue 1 year ago • 4 comments

Problem

Some tests leave stay processes behind after they exit.

This is the potential root cause for failed coverage-report generation, as well as other flakiness.

DoD

The Python test suite ensures that after each test function exit, there are no stray subprocesses left.

If there are any, the processes' argv are listed at WARNING level and the test fails, preventing the PR from being merged.

Related Issues

Work

### Implementation
- [ ] https://github.com/neondatabase/neon/pull/6270#issuecomment-1878842496
- [ ] https://github.com/neondatabase/neon/pull/6497
- [ ] https://github.com/neondatabase/neon/pull/6470
- [ ] https://github.com/neondatabase/neon/issues/6487
- [ ] https://github.com/neondatabase/neon/issues/6486
- [ ] https://github.com/neondatabase/neon/issues/6489

problame avatar Jan 26 '24 10:01 problame

Update:

  • The cgroup approach bumped up against securiy concerns.
  • Putting on pause.

jcsp avatar Jan 29 '24 11:01 jcsp

Paused until https://neondb.slack.com/archives/C059ZC138NR/p1707229781663129?thread_ts=1706265241.134019&cid=C059ZC138NR is resolved

problame avatar Feb 06 '24 14:02 problame

An alternative I've been thinking about, much more hacky way:

  • make the pytest runner process a subprocess reaper
  • after each test report and kill any children and fail

This would have a lot more insecurity than the original cgroup idea, and I am unsure about the license of the existing ctypes based prctl bindings. We could create this utility in rust quite easily.

Upside is that this requires no priviledges as far as I know.

koivunej avatar May 06 '24 13:05 koivunej

With #8742 it would appear that for a single run, we would no longer leak processes, possibly affected by #8714.

koivunej avatar Aug 16 '24 06:08 koivunej

(spring cleaning the storage board)

@bayandin is this ticket useful? I'm happy to close it if this is something we're unlikely to work on in the near future.

jcsp avatar Apr 07 '25 10:04 jcsp

When running the test suite locally, and hitting Ctrl-C, or if something hangs because of a buggy test, it definitely still happens sometimes that we leak subprocesses.

I think the approach from my

  • https://github.com/neondatabase/neon/pull/6470 is still the best approach to truly ensure isolation.

I don't see myself working on it in the foreseeable future though. We have a devprod team nowadays, that knows Python/pytest much better than I do. Maybe they are interested in picking it up?

problame avatar Apr 07 '25 14:04 problame

Closing due to inactivity / lack of resources to do this.

Process leakage hasn't been a problem for CI stability recently, so, doesn't justify investment ATM. It does happen, esp broker and storcon db processes aren't getting cleaned up properly in many tests.

For posterity, I use systemd-run as an (insufficient) substitute to at least constrain leaked processes

christian@neon-hetzner-dev-christian:[~/src/neon-work-1]: systemd-run --user --shell --property=MemoryHigh=99G --property=MemoryMax=100G --property=KillSignal=SIGKILL

problame avatar Jun 30 '25 10:06 problame