memory leak present even in trivial tests
Versions tested
- hypothesis 6.100.4
- Python 3.12.3
- attrs 23.2.0
Summary
For some reason even trivial stateless test and stateful test cases leak memory at rate of several megabytes per second.
Reproducers
Here are two test-nothing test cases:
-
test_given.py.gz - leaks on average ~ 2 MB/s, in the second half just 1 MB/s. On my machine the memory consumption goes form 25 MB to 145 MB at the end.
-
stateful_minimal.py.gz - leaks on average ~ 7 MB/s, in the second half of runtime still 7 MB/s . On my machine the memory consumption goes form 25 MB to 170 MB at the end.
Both files can be executed with python -m unittest so the pytest is out of equation.
I've attempted memory profiling using Scalene profiler version 1.5.41 but I did not get anything conclusive. At first it pointed to random.getstate(), so I've butchered internal/entropy.py:deterministic_PRNG() and control.py:deprecate_random_in_strategy() by commenting out all calls to random.getstate() but the overall memory consumption did not move. Scalene just started to claim that memory went to different places.
I don't trust Scalene's memory attribution very much, but there certainly is some sort of leak because running tests in memory-constrained environment invokes OOM killer:
Simplest way to verify that more memory is being consumed, and not just "virtually", is adjusting settings(max_examples=100) vs. settings(max_examples=5000). In BASH limiting memory with ulimit -d 50000 allows small test cases to finish but larger examples explode.
I don't understand subtleties of ulimit but similar effect can be achieved using systemd-run --user --shell -p MemoryMax=50M -p MemoryHigh=45M -p MemorySwapMax=0B command if you are on a system with Systemd. In that case OOM killer does not allow larger test cases to finish.
I think this is probably the conjecture data cache in internal/conjecture/engine.py. It is bounded but large, at CACHE_SIZE = 10000 examples.
There is a memory/performance trade-off here, and the benefits of this cache are mainly in the shrinking phase.
FWIW: Changing to CACHE_SIZE = 1000 looks like it caps the stateful_minimal test to under 50MB. But there are other large-ish caches, so to identify leaks as opposed to plus-size caches we probably need to run 5-10k examples just to get to steady state.
I wanted to see when memory consumption flattens out so I've modified stateful_minimal.py.gz with max_examples=5000000 and ran it with:
$ pytest --hypothesis-show-statistics --hypothesis-explain stateful_minimal.py
Occasionally this fails with:
E hypothesis.errors.Flaky: Inconsistent results from replaying a test case!
E last: VALID from None
E this: INTERESTING from DeadlineExceeded at /usr/lib/python3.12/site-packages/hypothesis/core.py:846
Sometimes it does finish, after about 5 minutes:
stateful_minimal.py::stateful_minimal::runTest:
- during reuse phase (0.00 seconds):
- Typical runtimes: ~ 1ms, of which ~ 1ms in data generation
- 1 passing examples, 0 failing examples, 0 invalid examples
- during generate phase (313.17 seconds):
- Typical runtimes: ~ 0-3 ms, of which ~ 0-3 ms in data generation
- 69508 passing examples, 0 failing examples, 0 invalid examples
- Stopped because nothing left to do
At this point it ate 440 MB of memory.
Here's memory consumption over time, with 10 second sampling interval:
RssAnon: 24320 kB
RssAnon: 93056 kB
RssAnon: 162176 kB
RssAnon: 229632 kB
RssAnon: 291840 kB
RssAnon: 337616 kB
RssAnon: 354128 kB
RssAnon: 355280 kB
RssAnon: 355280 kB
RssAnon: 355360 kB
RssAnon: 355360 kB
RssAnon: 355360 kB
RssAnon: 355360 kB
RssAnon: 354792 kB
RssAnon: 354792 kB
RssAnon: 355816 kB
RssAnon: 361192 kB
RssAnon: 366056 kB
RssAnon: 371176 kB
RssAnon: 375656 kB
RssAnon: 382952 kB
RssAnon: 389480 kB
RssAnon: 393064 kB
RssAnon: 403048 kB
RssAnon: 411624 kB
RssAnon: 421224 kB
RssAnon: 429544 kB
RssAnon: 435688 kB
RssAnon: 439912 kB
RssAnon: 450792 kB
(Generated with trivial while true; do grep RssAnon /proc/`pgrep -f unittest`/status; sleep 10; done.)
It is possible that it caches all 69508 examples? Or why it should be still consuming more memory?
Please note I'm not saying this is some huge problem - at least until we get to hypofuzzing state machines (which is the original motivation for my experiment).
Hm. I wonder if that DeadlineExceeded might be from a GC run. Also wonder if that post-plateau growth would be influenced by forcing GC occasionally?
So, I ran the test in memray to get better measurements, and in summary I don't see strong evidence of leaks. What I think we're seeing is that memory consumption has two main drivers:
- Short term, the caches. Mainly the conjecture data cache, at around 30kB per example. Bounded, mainly by
CACHE_SIZEinengine.py. - Long term,
DataTreewhich tracks previously tried data to avoid repeats, at around 1kB per example. Unbounded.
As evidence (running the stateful test above, about 70k examples):
Default run
Run with CACHE_SIZE=1000
Possible actions:
- Check the sizes above with intuition, if they are much bigger than expected then it may be worth investigating further.
- The
DeadlineExceeded, which I guess is caused by GC, is an annoying flake. Should we perhaps force periodicgc.collect()outside of the tests, or add gc callbacks to account for the time taken? - Possibly tune down
CACHE_SIZEor somehow be more selective in what we cache or retain.
@Zac-HD , @tybug , does this sound reasonable?
I think either of you has a better intuition for memory size than I. I would say it is very possible memory usage has increased recently due to ir-related churn and we should do a detailed pass one things settle. In particular 1kb/example for DataTree sounds a bit high to me. I wonder what these graphs look like pre-ir; say v6.98.0.
That's larger than I was expecting, but makes sense once I actually think about the data structures and objects involved. I agree that it's plausible the IR migration is causing temporarily higher memory usage, both because we're carrying around two representations now, and because the IR hasn't really been memory-optimized yet. Not worth trying to fix that before we finish migrating though.
Adding gc.collect() before each (or each 1000th) invocation of the user code isn't a terrible idea, but I'm not sure it'd avoid occasional gc pauses anyway - for that you'd have to call gc.freeze(), and that's a semantics change I don't want.
Good idea to compare with pre-ir! v6.98.0 below, at first I left it running over late breakfast and it didn't finish... I guess the exhaustion tracking has improved! Anyway, with number of examples set equal:
You'll notice immediately that we use a lot less memory in the post-ir world: the (probable) DataTree contribution is halved or more. Yay!
Less obviously, we produce more reference cycles, so the GC works harder. It's worth keeping an eye on this during development/migration, since data structures are harder than code to "fix" afterwards .
[...] it's plausible the IR migration is causing temporarily higher memory usage, both because we're carrying around two representations now, and because the IR hasn't really been memory-optimized yet. Not worth trying to fix that before we finish migrating though.
Yep, agree on this (with both of you). We can revisit afterwards - for the record, if nothing else.
Adding
gc.collect()before each (or each 1000th) invocation of the user code isn't a terrible idea, but I'm not sure it'd avoid occasional gc pauses anyway - for that you'd have to callgc.freeze(), and that's a semantics change I don't want.
Or alternatively record and account for the pauses using gc.callbacks. Worth it? I don't know, but it is likely to hit any test with high example count and default deadline setting.
Or alternatively record and account for the pauses using
gc.callbacks. Worth it? I don't know, but it is likely to hit any test with high example count and default deadline setting.
Oh, that's a better idea, we can attribute gc time correctly in statistics and observability output too.