crawlee
crawlee copied to clipboard
Dataset API - socket
During working on TripAdvisor, I discovered that the dataset API gets overloaded when I send for example 10 request at the time with higher concurrency. So I was wondering whether it would be possible to use WebSocket or something like this to push higher amounts of data at once and speed up the process since I am using cheerio with custom requests this is my final bottleneck. The process is still pretty fast. I am just curious if this is something that we are considering.
Technically, the AutoscaledPool
should not allow the overloading to happen and downscale accordingly when any API is overloaded. So if you're simply pushing to dataset once per request, the pool should scale down to match the slower pipe, which seems to be our API.
@mtrunkat @jancurn Can we improve the speed of dataset API?
IMHO we can reduce the rate limiting for this API endpoint
The rate limit is high. @petrpatek are pushing the data to the dataset in batches (tens or hundreds of items) or item by item?
@mtrunkat Yes in batches of hundreds. I limit the concurrency of pending promises to 10, but it still seems to be a lot when there is the autoscale pool concurrency of 10. I end up limiting the pending promises to 3 and it is ok now.
How does it work? What are you getting from the scrapes and what are you pushing? You can use .pushData([item, item, item])
to push them in a single call instead of calling e.g. items.forEach(item => Apify.pushData(item)
.
Would that help?
Closing as stale.