crawlee AutoscaledPool scales down too slow due client/API being overloaded

AutoscaledPool scales down too slow due client/API being overloaded

Open metalwarrior665 opened this issue 4 years ago • 3 comments

Honestly, I don't know what is the expected behavior. I can see 2 general approaches:

Autoscaling just helps a bit to slow things down but doesn't do so very fast, aggressively or rigorously and doesn't try to prevent a crash.
Autoscaling should ensure the actor will not crash on the API rate limit.

Right now, it behaves like 1. If you are overloading the rate limit of the API, you can be almost certain the actor will eventually crash. The autoscaling sometimes slows down but it is too weak to work consistently. I'm not talking about cases where you somehow spawn millions of API requests but about normal fast scraping process.

I don't really have a strong opinion on this. I can see where autoscaling being too "defensive" could hurt the performance of the common runs. On the other hand, the error you eventually get is pretty nasty for an unexperienced user.

It would be just nice to know if we agree on how this should behave.

Apr 17 '20 21:04 metalwarrior665

I monitored it some time ago and the scaling itself works quite well in preventing a sudden crash. A problem arises when the load is very high all the time, but not extremely high. Especially when processing of a single Request may produce tens or hundreds of API requests. Because AutoscaledPool obviously has no way how to downscale those. It can only prevent new Requests from being processed.

AutoscaledPool does not care if it’s seeing a third retry or seventh retry. Sometimes a request gets unlucky and even though other requests are being processed fine, or retried only a few times, this one just keeps getting retried until it finally crashes the crawl.

We could improve the overloading algorithm to give more weight to requests that have higher retry counts. And maybe stop any additional tasks from running completely, when there is a request that’s on it last retry.

Apr 18 '20 08:04 mnmkng

Although I must add that I’m using the Pool for some highly intensive processing tasks that run through tens of thousands of tasks in minutes and it never had a problem.

Apr 18 '20 08:04 mnmkng

I think my run qualifies for the the load is very high all the time, but not extremely high

Apr 18 '20 09:04 metalwarrior665

Closing as stale

Jul 17 '23 15:07 B4nan

crawlee crawlee copied to clipboard

AutoscaledPool scales down too slow due client/API being overloaded

crawlee
crawlee copied to clipboard