hypothesis memory leak present even in trivial tests

Versions tested

hypothesis 6.100.4
Python 3.12.3
attrs 23.2.0

Summary

For some reason even trivial stateless test and stateful test cases leak memory at rate of several megabytes per second.

Reproducers

Here are two test-nothing test cases:

test_given.py.gz - leaks on average ~ 2 MB/s, in the second half just 1 MB/s. On my machine the memory consumption goes form 25 MB to 145 MB at the end.
stateful_minimal.py.gz - leaks on average ~ 7 MB/s, in the second half of runtime still 7 MB/s . On my machine the memory consumption goes form 25 MB to 170 MB at the end.

Both files can be executed with python -m unittest so the pytest is out of equation.

I've attempted memory profiling using Scalene profiler version 1.5.41 but I did not get anything conclusive. At first it pointed to random.getstate(), so I've butchered internal/entropy.py:deterministic_PRNG() and control.py:deprecate_random_in_strategy() by commenting out all calls to random.getstate() but the overall memory consumption did not move. Scalene just started to claim that memory went to different places.

I don't trust Scalene's memory attribution very much, but there certainly is some sort of leak because running tests in memory-constrained environment invokes OOM killer:

Simplest way to verify that more memory is being consumed, and not just "virtually", is adjusting settings(max_examples=100) vs. settings(max_examples=5000). In BASH limiting memory with ulimit -d 50000 allows small test cases to finish but larger examples explode.

I don't understand subtleties of ulimit but similar effect can be achieved using systemd-run --user --shell -p MemoryMax=50M -p MemoryHigh=45M -p MemorySwapMax=0B command if you are on a system with Systemd. In that case OOM killer does not allow larger test cases to finish.

May 06 '24 12:05 pspacek

I think this is probably the conjecture data cache in internal/conjecture/engine.py. It is bounded but large, at CACHE_SIZE = 10000 examples.

There is a memory/performance trade-off here, and the benefits of this cache are mainly in the shrinking phase.

May 07 '24 06:05 jobh

FWIW: Changing to CACHE_SIZE = 1000 looks like it caps the stateful_minimal test to under 50MB. But there are other large-ish caches, so to identify leaks as opposed to plus-size caches we probably need to run 5-10k examples just to get to steady state.

May 07 '24 08:05 jobh

I wanted to see when memory consumption flattens out so I've modified stateful_minimal.py.gz with max_examples=5000000 and ran it with:

$ pytest --hypothesis-show-statistics --hypothesis-explain stateful_minimal.py

Occasionally this fails with:

E               hypothesis.errors.Flaky: Inconsistent results from replaying a test case!
E                 last: VALID from None
E                 this: INTERESTING from DeadlineExceeded at /usr/lib/python3.12/site-packages/hypothesis/core.py:846

Sometimes it does finish, after about 5 minutes:

stateful_minimal.py::stateful_minimal::runTest:

  - during reuse phase (0.00 seconds):
    - Typical runtimes: ~ 1ms, of which ~ 1ms in data generation
    - 1 passing examples, 0 failing examples, 0 invalid examples

  - during generate phase (313.17 seconds):
    - Typical runtimes: ~ 0-3 ms, of which ~ 0-3 ms in data generation
    - 69508 passing examples, 0 failing examples, 0 invalid examples

  - Stopped because nothing left to do

At this point it ate 440 MB of memory.

Here's memory consumption over time, with 10 second sampling interval:

RssAnon:	  24320 kB
RssAnon:	  93056 kB
RssAnon:	 162176 kB
RssAnon:	 229632 kB
RssAnon:	 291840 kB
RssAnon:	 337616 kB
RssAnon:	 354128 kB
RssAnon:	 355280 kB
RssAnon:	 355280 kB
RssAnon:	 355360 kB
RssAnon:	 355360 kB
RssAnon:	 355360 kB
RssAnon:	 355360 kB
RssAnon:	 354792 kB
RssAnon:	 354792 kB
RssAnon:	 355816 kB
RssAnon:	 361192 kB
RssAnon:	 366056 kB
RssAnon:	 371176 kB
RssAnon:	 375656 kB
RssAnon:	 382952 kB
RssAnon:	 389480 kB
RssAnon:	 393064 kB
RssAnon:	 403048 kB
RssAnon:	 411624 kB
RssAnon:	 421224 kB
RssAnon:	 429544 kB
RssAnon:	 435688 kB
RssAnon:	 439912 kB
RssAnon:	 450792 kB

(Generated with trivial while true; do grep RssAnon /proc/`pgrep -f unittest`/status; sleep 10; done.)

It is possible that it caches all 69508 examples? Or why it should be still consuming more memory?

Please note I'm not saying this is some huge problem - at least until we get to hypofuzzing state machines (which is the original motivation for my experiment).

May 07 '24 09:05 pspacek

Hm. I wonder if that DeadlineExceeded might be from a GC run. Also wonder if that post-plateau growth would be influenced by forcing GC occasionally?

May 07 '24 09:05 jobh

So, I ran the test in memray to get better measurements, and in summary I don't see strong evidence of leaks. What I think we're seeing is that memory consumption has two main drivers:

Short term, the caches. Mainly the conjecture data cache, at around 30kB per example. Bounded, mainly by CACHE_SIZE in engine.py.
Long term, DataTree which tracks previously tried data to avoid repeats, at around 1kB per example. Unbounded.

As evidence (running the stateful test above, about 70k examples):

Default run newplot(1)

Run with CACHE_SIZE=1000 newplot

Possible actions:

Check the sizes above with intuition, if they are much bigger than expected then it may be worth investigating further.
The DeadlineExceeded, which I guess is caused by GC, is an annoying flake. Should we perhaps force periodic gc.collect() outside of the tests, or add gc callbacks to account for the time taken?
Possibly tune down CACHE_SIZE or somehow be more selective in what we cache or retain.

@Zac-HD , @tybug , does this sound reasonable?

May 08 '24 13:05 jobh

I think either of you has a better intuition for memory size than I. I would say it is very possible memory usage has increased recently due to ir-related churn and we should do a detailed pass one things settle. In particular 1kb/example for DataTree sounds a bit high to me. I wonder what these graphs look like pre-ir; say v6.98.0.

May 08 '24 18:05 Liam-DeVoe

That's larger than I was expecting, but makes sense once I actually think about the data structures and objects involved. I agree that it's plausible the IR migration is causing temporarily higher memory usage, both because we're carrying around two representations now, and because the IR hasn't really been memory-optimized yet. Not worth trying to fix that before we finish migrating though.

Adding gc.collect() before each (or each 1000th) invocation of the user code isn't a terrible idea, but I'm not sure it'd avoid occasional gc pauses anyway - for that you'd have to call gc.freeze(), and that's a semantics change I don't want.

May 09 '24 08:05 Zac-HD

Good idea to compare with pre-ir! v6.98.0 below, at first I left it running over late breakfast and it didn't finish... I guess the exhaustion tracking has improved! Anyway, with number of examples set equal:

newplot(3)

You'll notice immediately that we use a lot less memory in the post-ir world: the (probable) DataTree contribution is halved or more. Yay!

Less obviously, we produce more reference cycles, so the GC works harder. It's worth keeping an eye on this during development/migration, since data structures are harder than code to "fix" afterwards .

May 09 '24 10:05 jobh

[...] it's plausible the IR migration is causing temporarily higher memory usage, both because we're carrying around two representations now, and because the IR hasn't really been memory-optimized yet. Not worth trying to fix that before we finish migrating though.

Yep, agree on this (with both of you). We can revisit afterwards - for the record, if nothing else.

Adding gc.collect() before each (or each 1000th) invocation of the user code isn't a terrible idea, but I'm not sure it'd avoid occasional gc pauses anyway - for that you'd have to call gc.freeze(), and that's a semantics change I don't want.

Or alternatively record and account for the pauses using gc.callbacks. Worth it? I don't know, but it is likely to hit any test with high example count and default deadline setting.

May 09 '24 10:05 jobh

Or alternatively record and account for the pauses using gc.callbacks. Worth it? I don't know, but it is likely to hit any test with high example count and default deadline setting.

Oh, that's a better idea, we can attribute gc time correctly in statistics and observability output too.

May 09 '24 16:05 Zac-HD