crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Set desiredConcurrency based on type of crawler and available memory

Open metalwarrior665 opened this issue 2 years ago • 2 comments

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

Unless you are running with very low memory, usually you want the scrapers to already start with some concurrency to speed up the initial part of the scraping. This is even more important if the scraper is very short, in that case the crawler might not even have a chance to upscale.

We can make a conservative mapping of the desiredConcurrency based on available memory.

Motivation

As above

Ideal solution or implementation, and any additional constraints

Example solution here: https://github.com/apify-projects/store-crawler-google-places/blob/master/src/utils/misc-utils.ts#L485

We probably don't want to silently reduce maxCnocurrency like in the example

Alternative solutions or implementations

No response

Other context

No response

metalwarrior665 avatar Feb 14 '23 12:02 metalwarrior665

We already do that based on the type, cheerio sets the default to 10 (#1428).

Note that the solution you proposed seems to be quite tied to the apify platform, we should be careful with that, crawlee needs to work out of box in other environments too, and I am not sure how common is our memory/cpu constraints ("4gb = 1cpu".)

But maybe it will work just fine, we just need to test it carefully. Looks like the proposal sets the concurrency to half the memory in gh, which is much less aggressive than the static 10 we have now for cheerio - but on the other hand it could be too aggressive for a browser crawler.

B4nan avatar Feb 14 '23 12:02 B4nan

The code is more an example than a proposal since it was tailored to Google Maps which is quite heavy. We can make it scale a bit higher with Cheerio and a bit lower with the browser to be on the safer side.

I guess we are not able to measure CPU allocation (from inside the local Docker container and such) so the memory would be an approximation. On most cloud platforms, those should scale up more or less together.

@mnmkng can probably chime in since he was setting up the initial auto-scaling.

metalwarrior665 avatar Feb 14 '23 18:02 metalwarrior665