crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

fix: Use `USS` instead of `RSS` to estimate children process memory usage

Open Pijukatel opened this issue 7 months ago • 1 comments

Description

Use Unique Set Size (USS) to estimate children process memory usage to avoid overestimation of used memory due to shared memory being counted multiple times when using Resident Set Size (RSS).

Add test.

Issues

  • Closes: #1206

Pijukatel avatar May 21 '25 14:05 Pijukatel

Based on the docs:

Using uss might actually underestimate memory usage as it would not count the shared memory at all. Seems like pss might be the best approximation of used memory in our case, but is available only on Linux. Maybe we iteratively improve at least the Linux based estimation for now and leave improving Windows and other OS based estimation for later

Pijukatel avatar May 23 '25 14:05 Pijukatel

To come up with the test was really hard. The test is not nice at all but testing the memory usage estimation is really tricky due to to Python being too high-level for some precise memory control.

Pijukatel avatar May 30 '25 20:05 Pijukatel

Couple of questions :slightly_smiling_face:

  1. how does this compare to what we do in the JS version? It looks like it also uses RSS (https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L85-L85) - I'd like to know why it doesn't cause problems there (or does it?)
  2. when running locally, Crawlee thinks "memory is overloaded" means "stuff that we own takes up more memory than some configured part of the total system memory" - if our calculation of the used memory fails to take something into account, we could easily smother the system
    • are you aware of anything "missed" by USS in a Crawlee workload?
    • do we have some safety mechanism that would say the system is overloaded if the system memory has something like >95% utilization, regardless of CRAWLEE_AVAILABLE_MEMORY_RATIO? shouldn't we?

janbuchar avatar Jun 05 '25 11:06 janbuchar

  1. how does this compare to what we do in the JS version? It looks like it also uses RSS (https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L85-L85) - I'd like to know why it doesn't cause problems there (or does it?)

So far I can only guess. I have to make some experiments with the JS version to have some data. But if it relies on RSS only, then I think it could also overestimate used memory.

do we have some safety mechanism that would say the system is overloaded if the system memory has something like >95% utilization, regardless of CRAWLEE_AVAILABLE_MEMORY_RATIO? shouldn't we?

Yes, I think it could be good safety measure to bound it like that, regardless of this change.

are you aware of anything "missed" by USS in a Crawlee workload?

In Crawlee only probably not, but my guess is, that multiple Playwright processes could actually use some shared memory which would be overestimated by RSS and probably underestimated with USS. So PSS seems to me like the best in our case as it takes into account shared memory in somewhat predictable way.

Pijukatel avatar Jun 05 '25 12:06 Pijukatel

So far I can only guess. I have to make some experiments with the JS version to have some data. But if it relies on RSS only, then I think it could also overestimate used memory.

Keep in mind that on the platform, memory usage (and pretty much all the scaling metrics) is coming over websockets, we don't measure it ourselves, so it's very much possible we dont do it perfectly, and nobody noticed, since on localhost, we use 1/4 of the available memory by default. Also given the memory scales with CPU, you usually run things with enough memory.

B4nan avatar Jun 05 '25 13:06 B4nan