headless-chrome-crawler
headless-chrome-crawler copied to clipboard
Queueing same url on multiple workers in cluster with Redis cache results in duplicates
What is the current behavior?
Using a Redis cache for the queue and a cluster of processes crawling, the crawler is repeating requests.
If the current behavior is a bug, please provide the steps to reproduce
Create a cluster in which each workers process starts crawling the same url (on a crawler using Redis cache)
What is the expected behavior?
Even if the same url is added multiple times, I would expect there to be no duplicates. Should this be the case?
Please tell us about your environment:
- Version: 1.8.0
- Platform / OS version: Windows
- Node.js version: 8.11.3
I have the same problem, I confirm
Have you tried enabling the skipDuplicates
and skipRequestedRedirect
options in the queue options?
I believe that the current behavior is that it will crawl duplicate urls because it finds them as different request/response pairs. But if you enable those options it should make more duplicate requests. Please confirm if your problem was fixed this way.
@BubuAnabelas I set it up with skipDuplicates and skipRequestedRedirect but it's still able to be reproduced issue for me.
Have a feeling it's because of differing 'extraHeaders' maybe?
Any guidance here would be appreciated; am redis noob, and moreover just want to make my crawler more efficient and not hit the same pages once per worker.
Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.
- First created a sqlite database.
- Then in RequestStarted event, insert the current url.
- In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
- In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.