fix: Use `USS` instead of `RSS` to estimate children process memory usage
Description
Use Unique Set Size (USS) to estimate children process memory usage to avoid overestimation of used memory due to shared memory being counted multiple times when using Resident Set Size (RSS).
Add test.
Issues
- Closes: #1206
Based on the docs:
Using uss might actually underestimate memory usage as it would not count the shared memory at all. Seems like pss might be the best approximation of used memory in our case, but is available only on Linux. Maybe we iteratively improve at least the Linux based estimation for now and leave improving Windows and other OS based estimation for later
To come up with the test was really hard. The test is not nice at all but testing the memory usage estimation is really tricky due to to Python being too high-level for some precise memory control.
Couple of questions :slightly_smiling_face:
- how does this compare to what we do in the JS version? It looks like it also uses RSS (https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L85-L85) - I'd like to know why it doesn't cause problems there (or does it?)
- when running locally, Crawlee thinks "memory is overloaded" means "stuff that we own takes up more memory than some configured part of the total system memory" - if our calculation of the used memory fails to take something into account, we could easily smother the system
- are you aware of anything "missed" by USS in a Crawlee workload?
- do we have some safety mechanism that would say the system is overloaded if the system memory has something like >95% utilization, regardless of
CRAWLEE_AVAILABLE_MEMORY_RATIO? shouldn't we?
- how does this compare to what we do in the JS version? It looks like it also uses RSS (https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L85-L85) - I'd like to know why it doesn't cause problems there (or does it?)
So far I can only guess. I have to make some experiments with the JS version to have some data. But if it relies on RSS only, then I think it could also overestimate used memory.
do we have some safety mechanism that would say the system is overloaded if the system memory has something like >95% utilization, regardless of CRAWLEE_AVAILABLE_MEMORY_RATIO? shouldn't we?
Yes, I think it could be good safety measure to bound it like that, regardless of this change.
are you aware of anything "missed" by USS in a Crawlee workload?
In Crawlee only probably not, but my guess is, that multiple Playwright processes could actually use some shared memory which would be overestimated by RSS and probably underestimated with USS. So PSS seems to me like the best in our case as it takes into account shared memory in somewhat predictable way.
So far I can only guess. I have to make some experiments with the JS version to have some data. But if it relies on RSS only, then I think it could also overestimate used memory.
Keep in mind that on the platform, memory usage (and pretty much all the scaling metrics) is coming over websockets, we don't measure it ourselves, so it's very much possible we dont do it perfectly, and nobody noticed, since on localhost, we use 1/4 of the available memory by default. Also given the memory scales with CPU, you usually run things with enough memory.