crawlee
crawlee copied to clipboard
AutoscaledPool scales down too slow due client/API being overloaded
Honestly, I don't know what is the expected behavior. I can see 2 general approaches:
- Autoscaling just helps a bit to slow things down but doesn't do so very fast, aggressively or rigorously and doesn't try to prevent a crash.
- Autoscaling should ensure the actor will not crash on the API rate limit.
Right now, it behaves like 1. If you are overloading the rate limit of the API, you can be almost certain the actor will eventually crash. The autoscaling sometimes slows down but it is too weak to work consistently. I'm not talking about cases where you somehow spawn millions of API requests but about normal fast scraping process.
I don't really have a strong opinion on this. I can see where autoscaling being too "defensive" could hurt the performance of the common runs. On the other hand, the error you eventually get is pretty nasty for an unexperienced user.
It would be just nice to know if we agree on how this should behave.
I monitored it some time ago and the scaling itself works quite well in preventing a sudden crash. A problem arises when the load is very high all the time, but not extremely high. Especially when processing of a single Request may produce tens or hundreds of API requests. Because AutoscaledPool obviously has no way how to downscale those. It can only prevent new Requests from being processed.
AutoscaledPool does not care if it’s seeing a third retry or seventh retry. Sometimes a request gets unlucky and even though other requests are being processed fine, or retried only a few times, this one just keeps getting retried until it finally crashes the crawl.
We could improve the overloading algorithm to give more weight to requests that have higher retry counts. And maybe stop any additional tasks from running completely, when there is a request that’s on it last retry.
Although I must add that I’m using the Pool for some highly intensive processing tasks that run through tens of thousands of tasks in minutes and it never had a problem.
I think my run qualifies for the the load is very high all the time, but not extremely high
Closing as stale