crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

`maxRequestsPerCrawl` with RQ optimizations drops requests

Open barjin opened this issue 4 months ago • 0 comments

When maxRequestsPerCrawl is used, the crawler doesn't enqueue over this limit. This saves RQ writes.

The limit doesn't understand RQ deduplication, though:

  • maxRequestsPerCrawl is e.g. 10
  • The first crawled page contains 10 instances of the same link (e.g to itself) and one new link (e.g. /new)
  • Crawlee enqueues links.slice(0,10).
    • This is based on a wrong assumption. After enqueuing those links, the RQ contents won't change (RQ removes all the links as duplicates / already handled).
    • The last link (the actual new link) is not enqueued
  • After processing the first request, the crawler will finish, as the RQ doesn't have any unhandled requests.

Example

import { CheerioCrawler } from '@crawlee/cheerio';
import http from 'http';

(async () => {
  const server = http.createServer((req, res) => {
      res.writeHead(200, { 'Content-Type': 'text/html' });
      res.end(`
          <!DOCTYPE html>
          <html lang="en">
          <body>
              <ul>
                  ${Array.from({ length: 10 }, (_, i) => `<li><a href="/">Link ${i + 1}</a></li>`).join('')}
              </ul>
              <ul>
                  <li><a href="/2">Link 11</a></li>
              </ul>
          </body>
          </html>
      `);
  });

  server.listen(3000);

  const crawler = new CheerioCrawler({
      requestHandler: async ({ request, enqueueLinks, log }) => {
          log.info(`Processing ${request.url}...`);
          await enqueueLinks();
      },
      maxRequestsPerCrawl: 10,
  });

  await crawler.run(['http://localhost:3000']);

  server.close();
})();

Observed behaviour

(as described above)

INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

Expected behaviour

(crawling both / and /2)

INFO  CheerioCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}

Observed when debugging the Generic Actors E2E tests in SDK (unrelated PR here).

barjin avatar Sep 04 '25 12:09 barjin