crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Dataset API - socket

Open petrpatek opened this issue 6 years ago • 5 comments

During working on TripAdvisor, I discovered that the dataset API gets overloaded when I send for example 10 request at the time with higher concurrency. So I was wondering whether it would be possible to use WebSocket or something like this to push higher amounts of data at once and speed up the process since I am using cheerio with custom requests this is my final bottleneck. The process is still pretty fast. I am just curious if this is something that we are considering.

petrpatek avatar Feb 21 '19 10:02 petrpatek

Technically, the AutoscaledPool should not allow the overloading to happen and downscale accordingly when any API is overloaded. So if you're simply pushing to dataset once per request, the pool should scale down to match the slower pipe, which seems to be our API.

@mtrunkat @jancurn Can we improve the speed of dataset API?

mnmkng avatar Feb 21 '19 12:02 mnmkng

IMHO we can reduce the rate limiting for this API endpoint

jancurn avatar Feb 21 '19 13:02 jancurn

The rate limit is high. @petrpatek are pushing the data to the dataset in batches (tens or hundreds of items) or item by item?

mtrunkat avatar Feb 21 '19 13:02 mtrunkat

@mtrunkat Yes in batches of hundreds. I limit the concurrency of pending promises to 10, but it still seems to be a lot when there is the autoscale pool concurrency of 10. I end up limiting the pending promises to 3 and it is ok now.

petrpatek avatar Feb 21 '19 14:02 petrpatek

How does it work? What are you getting from the scrapes and what are you pushing? You can use .pushData([item, item, item]) to push them in a single call instead of calling e.g. items.forEach(item => Apify.pushData(item).

Would that help?

mnmkng avatar Feb 21 '19 15:02 mnmkng

Closing as stale.

B4nan avatar Jul 17 '23 15:07 B4nan