crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Fix autoscaled pool scaling behavior on 429 Too Many Requests

Open vdusek opened this issue 3 months ago • 3 comments

Description

  • Crawlee does not currently handle 429 Too Many Requests responses correctly.
  • When a target server starts returning 429s, Crawlee does not slow down.
  • Instead, due to the current autoscaled pool logic, Crawlee may actually scale concurrency up when responses get slower (because of less CPU work).
  • This creates a "death spiral" - the slower the server, the faster Crawlee increases concurrency, which can quickly overwhelm small websites.

Proposed solution

  • Detect 429 responses and implement proper backoff logic (reducing concurrency of autoscaled pool, cooldown period, ...).
  • Ensure the autoscaled pool does not interpret slow responses or 429s as a signal to increase concurrency.
  • Consider respecting Retry-After headers if present.

References

  • This was originally discussed on Slack https://apify.slack.com/archives/CD0SF6KD4/p1756993901117969.

vdusek avatar Sep 30 '25 12:09 vdusek

At the moment, the AutoscaledPool only cares about 429s returned by the Apify API. We never tried to prevent clobbering the target website.

janbuchar avatar Sep 30 '25 13:09 janbuchar

Imagine a general scenario where there are links to various different URLs in the request queue. You get 429 from one of those URLs, but the other unrelated URLs are not affected. Slowing down the whole crawler in general does not seem good in such a case. The crawler should probably implement some additional backoff logic on a different level without the autoscaled pool even noticing it.

That would require tracking 429s on groups of URLs and acting(back-off logic, fixed larger interval between requests on that specific site,...) on that group, without affecting different unrelated groups of URLs.

Pijukatel avatar Oct 03 '25 06:10 Pijukatel

I would suggest to also not implementing this in BasicCrawler, but instead creating a dedicated component that would be only used by BasicCrawler and group related functionality there.

now we have in __run_task_function:

1 - get request
2 - possibly looks at robots.txt
3 - start processing the request, ...

I would extract the step 2 and together with this new functionality, delegate it to RequestAnalyzer (placeholder name). What is the responsibility of RequestAnalyzer?

  • If desired, decide how the request should be handled based on robots.txt (optional, which can be turned on/off)
  • Keep track of intentionally time-blocked/ slowed down/ delayed requests. So, for example, something like this would be tracked internally:
group1=UrlGroup(match="crawlee.dev")
group1.next_allowed_request_at = datetime(...)
group1.already_retrived_unhandled_requests = {https://crawlee.dev/a,  https://crawlee.dev/b,  https://crawlee.dev/c}
group2=UrlGroup(match="example.com")
group2.next_allowed_request_at = datetime(...)
group2.already_retrived_un_handled_requests = {https://example.com/abc}
  • RequestAnalyzer would implement fetch_next_request that would pick from one of the previously delayed requests if available and ready to go and some sort of add_request

And the new __run_task_function would be something like this:

0 - Check RequestAnalyzer if it has some previously delayed unhandled request that is ready to be handled now
1 - get new request if no previously fetched request is ready from RequestAnalyzer
2 - Use RequestAnalyzer only on new request to see if it belongs to one of the groups that should be delayed, or not handled at all
3 - start processing the request, if RequestAnalyzer is OK with it.

Pijukatel avatar Oct 03 '25 06:10 Pijukatel