filprofiler
filprofiler copied to clipboard
profiled process getting killed much too early on Mac by OOM detector
Version information
Fil: 2021.12.2 Python: 3.7.12 (default, Dec 20 2021, 11:33:29) [Clang 13.0.0 (clang-1300.0.29.3)]
Additional context that could be valuable is that this is on MacOS Monterey on an M1 Max - but I'm specifically running this as an x64 process, not ARM.
The machine has 64 GB of RAM.
This is what is getting output:
=fil-profile= WARNING: Excessive swapping. Program itself allocated 28992299350 bytes, 19128373248 are resident, the difference (presumably swap) is 9863926102, which is more than available system bytes 9830146048
=fil-profile= WARNING: Detected out-of-memory condition, exiting soon.
=fil-profile= Host memory info: Ok(VirtualMemory { total: 68719476736, available: 9830146048, used: 7917293568, free: 2665578496, percent: 85.69525, active: 6971088896, inactive: 5435318272, wired: 946204672 }) Ok(SwapMemory { total: 1073741824, used: 106168320, free: 967573504, percent: 9.887695, swapped_in: 25114910720, swapped_out: 488624128 })
=fil-profile= Process memory info: Ok(MemoryInfo { rss: 19128999936, vms: 68041039872, page_faults: 5594374, pageins: 0 })
=fil-profile= We'll try to dump out SVGs. Note that no HTML file will be written.
=fil-profile= Preparing to write to fil-result/2022-01-21T15:51:40.273
=fil-profile= Wrote flamegraph to "fil-result/2022-01-21T15:51:40.273/out-of-memory.svg"
=fil-profile= Wrote flamegraph to "fil-result/2022-01-21T15:51:40.273/out-of-memory-reversed.svg"
I can reproduce this consistently.
However, the process runs to completion if not run inside filprofile, and in fact seems to work with --disable-oom-detection just fine as well (the runs take hours, so all I can be certain of so far is that using the flag prevents the process from getting killed early on. I'll update/close this later if I get an actual OOM, but this exact same run has completed successfully before, and I've already hit ~5x the RAM that was in use when filprofile killed it, so I sort of doubt it will).
It's nice that the flag exists, but the behavior feels like a bug to me. My machine is handling the RAM usage just fine when I run the process, so somehow it seems like filprofile's calibration for OOM is way off.
Thanks for the bug report! I will take a look at the heuristics again.
Turning above into more readable form:
| Host | value |
|---|---|
| total | 68,719,476,736 |
| available | 9,830,146,048 |
| used | 7,917,293,568 |
| free | 2,665,578,496 |
| percent | 85% |
| active | 6,971,088,896 |
| inactive | 5,435,318,272 |
| Process | value |
|---|---|
| rss | 19,128,999,936 |
| vms | 68,041,039,872 |
- Theory: bad heuristic. Heuristic was added for macOS specifically, where swapping was very aggressive so the heuristic is "you have a lot of swap". But I wonder if on M1 hardware swapping is even more aggressive? Anecdotally I would expect that.
- Theory: more RAM invalidates heuristic. Or, maybe it's just that you a have a lot of RAM compared to macOS testing I've done.
- Theory: bad data. The numbers reported for the host are weird. Where did the rest of the memory go? So possibly the library used to get memory info is giving bad information.
One interesting thing I've observed since reporting this is that I can get similar behavior on Ubuntu 18.04. Sometimes fil-profiler will cause the process to exit within the first few minutes, but the process can complete successfully when --disable-oom-detection is provided.
On both of these OSes, I have some form of RAM compression enabled. I think that's the default for MacOS, but on Ubuntu I've enabled zram. I don't know if that's relevant to any of your hypotheses.
Fascinating. I guess that's a fourth theory: compression might distort the RAM availability statistics.
Still thinking about what to do... Some options:
- Status quo. Always an option. If small enough number of people have problems with the current OOM heuristic, maybe that's fine.
- OOM detection off by default. Most people won't run out of memory, probably, and if they do they can rerun; the goal is offline profiling anyway, there's no assumption that this is running production code. If I do this, probably should print a message at start saying "if you die half-way, re-run with this flag enabled".
- Tweak heuristics. Maybe heuristics could be more flexible/accurate? Linux has memory pressure data, on sufficiently new operating system with sufficiently good config. But not macOS.
- Special-case compressed RAM. Not sure it's the actual issue, or can be detected reliably even if it is, though.