crawlee
crawlee copied to clipboard
Excessive RQ writes in `RequestProvider.batchAddRequests`
- https://github.com/apify/crawlee/pull/2456 made it so that for large amounts of added URLs, we only check the
uniqueKeycache for the first batch (1000 by default) - with crazy websites such as https://docs.n8n.io/ that have 1250 links on every page (the same exact links), this means that every
enqueueLinkscall will cost 250 unnecessary RQ writes - we might want to make a separate cache only for
uniqueKey-based deduplication to avoid the problem that the linked PR intended to fix - cc @vladfrangu
Regarding the separate caching, perhaps something dead simple like this using https://www.npmjs.com/package/@node-rs/xxhash would suffice - we'd have a good-enough deduplication without using up all our memory.
set(uniqueKey) {
cache[xxhash32(uniqueKey) % cache.length] = uniqueKey
}
has(uniqueKey): boolean {
return cache[xxhash32(uniqueKey) % cache.length] === uniqueKey
}
Related to https://github.com/apify/apify-sdk-python/issues/514