crawlee
crawlee copied to clipboard
`maxRequestsPerCrawl` with RQ optimizations drops requests
When maxRequestsPerCrawl is used, the crawler doesn't enqueue over this limit. This saves RQ writes.
The limit doesn't understand RQ deduplication, though:
maxRequestsPerCrawlis e.g.10- The first crawled page contains 10 instances of the same link (e.g to itself) and one new link (e.g.
/new) - Crawlee enqueues
links.slice(0,10).- This is based on a wrong assumption. After enqueuing those links, the RQ contents won't change (RQ removes all the links as duplicates / already handled).
- The last link (the actual new link) is not enqueued
- After processing the first request, the crawler will finish, as the RQ doesn't have any unhandled requests.
Example
import { CheerioCrawler } from '@crawlee/cheerio';
import http from 'http';
(async () => {
const server = http.createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.end(`
<!DOCTYPE html>
<html lang="en">
<body>
<ul>
${Array.from({ length: 10 }, (_, i) => `<li><a href="/">Link ${i + 1}</a></li>`).join('')}
</ul>
<ul>
<li><a href="/2">Link 11</a></li>
</ul>
</body>
</html>
`);
});
server.listen(3000);
const crawler = new CheerioCrawler({
requestHandler: async ({ request, enqueueLinks, log }) => {
log.info(`Processing ${request.url}...`);
await enqueueLinks();
},
maxRequestsPerCrawl: 10,
});
await crawler.run(['http://localhost:3000']);
server.close();
})();
Observed behaviour
(as described above)
INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
Expected behaviour
(crawling both / and /2)
INFO CheerioCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}
Observed when debugging the Generic Actors E2E tests in SDK (unrelated PR here).