crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Crawlee increasing concurrency until it dies

Open ericvg97 opened this issue 1 month ago • 8 comments

I am deploying crawlee in a kubernetes pod. It gets recurrently OOMKilled because crawlee increases the desired concurrency continuously. I don't want to decrease the max_concurrency because I am crawling domains that are super lightweight while crawling others that aren't, and I'd like crawlee to maximize the throughput. I also could increase the RAM for the pod, but I think there is an underlying issue that would come up later (or I would just underuse my resources)

I am seeing this log which makes me suspicious crawlee doesn't actually know the memory and cpu it is using: current_concurrency = 21; desired_concurrency = 21; cpu = 0.0; mem = 0.0; event_loop = 0.148; client_info = 0.0

Cpu is not a big problem because the kernel throttles cpu for this pod, but mem is a hardlimit and kubernetes kills the pod.

For more context I am using the Playwright adaptative crawler with beautiful soup and headless firefox and my concurrency settings are: concurrency_settings = ConcurrencySettings(max_concurrency=45, desired_concurrency=10, min_concurrency=10)

and I am giving the pod

resources:
    limits:
      cpu: "12"
      memory: "12Gi"
    requests:
      cpu: "8"
      memory: "8Gi"

ericvg97 avatar Nov 06 '25 16:11 ericvg97

Hi @ericvg97, this is an interesting one 😁 Could you share some details about your Kubernetes setup? What's your container runtime? Is there some hosted Kubernetes service where this could be easily replicated?

~As a workaround, I suppose you could set max_concurrency in your crawler. See the concurrency_settings parameter to BasicCrawler.__init__~ - sorry, just noticed that you are aware of this option

janbuchar avatar Nov 06 '25 16:11 janbuchar

I sent you a discord private message as I'd rather not share too many details here

ericvg97 avatar Nov 07 '25 07:11 ericvg97

Hello, this could be connected to the https://github.com/apify/crawlee-python/issues/1224

I am seeing this log which makes me suspicious crawlee doesn't actually know the memory and cpu it is using: current_concurrency = 21; desired_concurrency = 21; cpu = 0.0; mem = 0.0; event_loop = 0.148; client_info = 0.0

This log output is not showing the resource utilization, but some internal metric for resource "overutilization in recent snapshots". So your log could be interpreted as:

  • CPU usage below the overutilization threshold
  • Memory usage below the overutilization threshold
  • Event loop - overutilized in 14.8% of recent measurements snapshots
  • Client API errors below the overutilization threshold

Log is hard to understand, but that is directly linked to the way the AutoscaledPool controller works: https://github.com/apify/crawlee-python/issues/705

Pijukatel avatar Nov 07 '25 07:11 Pijukatel

Oh I see, I was assuming it was more like a percentage.

But, either way, if my pod is being killed because it is using more than (or close to) 100% of the memory it is allowed to use, shouldn't crawlee at some point realize it is using too much memory or at least stop increasing the concurrency?

Also, with cpu, you can never overutilize it because the kernel throttles you when you reach 90 sth%, so the metric will never be bigger than 0 in these contexts...

ericvg97 avatar Nov 07 '25 08:11 ericvg97

I am not sure it is related, but I am not overriding available_memory_ratio default value in the configuration, so shouldn't crawlee target to using 25% of the available memory?

ericvg97 avatar Nov 07 '25 08:11 ericvg97

But, either way, if my pod is being killed because it is using more than (or close to) 100% of the memory it is allowed to use, shouldn't crawlee at some point realize it is using too much memory or at least stop increasing the concurrency?

Yes, it should, and normally it does. But the used memory estimation is kind of tricky, and there could be some issues. We have some estimation in place and a failsafe on top of it that should prevent OOM, but maybe that is not working in some cases.

Pijukatel avatar Nov 07 '25 10:11 Pijukatel

I am not sure it is related, but I am not overriding available_memory_ratio default value in the configuration, so shouldn't crawlee target to using 25% of the available memory?

It should, unless you specify Configuration.memory_mbytes as well https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/snapshotter.py#L121

Pijukatel avatar Nov 07 '25 10:11 Pijukatel

I am not specifying it either 🤔

ericvg97 avatar Nov 07 '25 11:11 ericvg97