Crawlee increasing concurrency until it dies
I am deploying crawlee in a kubernetes pod. It gets recurrently OOMKilled because crawlee increases the desired concurrency continuously. I don't want to decrease the max_concurrency because I am crawling domains that are super lightweight while crawling others that aren't, and I'd like crawlee to maximize the throughput. I also could increase the RAM for the pod, but I think there is an underlying issue that would come up later (or I would just underuse my resources)
I am seeing this log which makes me suspicious crawlee doesn't actually know the memory and cpu it is using:
current_concurrency = 21; desired_concurrency = 21; cpu = 0.0; mem = 0.0; event_loop = 0.148; client_info = 0.0
Cpu is not a big problem because the kernel throttles cpu for this pod, but mem is a hardlimit and kubernetes kills the pod.
For more context I am using the Playwright adaptative crawler with beautiful soup and headless firefox and my concurrency settings are:
concurrency_settings = ConcurrencySettings(max_concurrency=45, desired_concurrency=10, min_concurrency=10)
and I am giving the pod
resources:
limits:
cpu: "12"
memory: "12Gi"
requests:
cpu: "8"
memory: "8Gi"
Hi @ericvg97, this is an interesting one 😁 Could you share some details about your Kubernetes setup? What's your container runtime? Is there some hosted Kubernetes service where this could be easily replicated?
~As a workaround, I suppose you could set max_concurrency in your crawler. See the concurrency_settings parameter to BasicCrawler.__init__~ - sorry, just noticed that you are aware of this option
I sent you a discord private message as I'd rather not share too many details here
Hello, this could be connected to the https://github.com/apify/crawlee-python/issues/1224
I am seeing this log which makes me suspicious crawlee doesn't actually know the memory and cpu it is using: current_concurrency = 21; desired_concurrency = 21; cpu = 0.0; mem = 0.0; event_loop = 0.148; client_info = 0.0
This log output is not showing the resource utilization, but some internal metric for resource "overutilization in recent snapshots". So your log could be interpreted as:
- CPU usage below the overutilization threshold
- Memory usage below the overutilization threshold
- Event loop - overutilized in 14.8% of recent measurements snapshots
- Client API errors below the overutilization threshold
Log is hard to understand, but that is directly linked to the way the AutoscaledPool controller works: https://github.com/apify/crawlee-python/issues/705
Oh I see, I was assuming it was more like a percentage.
But, either way, if my pod is being killed because it is using more than (or close to) 100% of the memory it is allowed to use, shouldn't crawlee at some point realize it is using too much memory or at least stop increasing the concurrency?
Also, with cpu, you can never overutilize it because the kernel throttles you when you reach 90 sth%, so the metric will never be bigger than 0 in these contexts...
I am not sure it is related, but I am not overriding available_memory_ratio default value in the configuration, so shouldn't crawlee target to using 25% of the available memory?
But, either way, if my pod is being killed because it is using more than (or close to) 100% of the memory it is allowed to use, shouldn't crawlee at some point realize it is using too much memory or at least stop increasing the concurrency?
Yes, it should, and normally it does. But the used memory estimation is kind of tricky, and there could be some issues. We have some estimation in place and a failsafe on top of it that should prevent OOM, but maybe that is not working in some cases.
I am not sure it is related, but I am not overriding available_memory_ratio default value in the configuration, so shouldn't crawlee target to using 25% of the available memory?
It should, unless you specify Configuration.memory_mbytes as well
https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/snapshotter.py#L121
I am not specifying it either 🤔