crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Excessive RQ writes in `RequestProvider.batchAddRequests`

Open janbuchar opened this issue 5 months ago • 2 comments

  • https://github.com/apify/crawlee/pull/2456 made it so that for large amounts of added URLs, we only check the uniqueKey cache for the first batch (1000 by default)
  • with crazy websites such as https://docs.n8n.io/ that have 1250 links on every page (the same exact links), this means that every enqueueLinks call will cost 250 unnecessary RQ writes
  • we might want to make a separate cache only for uniqueKey-based deduplication to avoid the problem that the linked PR intended to fix
  • cc @vladfrangu

janbuchar avatar Aug 07 '25 14:08 janbuchar

Regarding the separate caching, perhaps something dead simple like this using https://www.npmjs.com/package/@node-rs/xxhash would suffice - we'd have a good-enough deduplication without using up all our memory.

set(uniqueKey) {
  cache[xxhash32(uniqueKey) % cache.length] = uniqueKey
}

has(uniqueKey): boolean {
  return cache[xxhash32(uniqueKey) % cache.length] === uniqueKey
}

janbuchar avatar Aug 07 '25 15:08 janbuchar

Related to https://github.com/apify/apify-sdk-python/issues/514

janbuchar avatar Aug 07 '25 15:08 janbuchar