headless-chrome-crawler Queueing same url on multiple workers in cluster with Redis cache results in duplicates

Queueing same url on multiple workers in cluster with Redis cache results in duplicates

Open lioreshai opened this issue 6 years ago • 4 comments

What is the current behavior?

Using a Redis cache for the queue and a cluster of processes crawling, the crawler is repeating requests.

If the current behavior is a bug, please provide the steps to reproduce

Create a cluster in which each workers process starts crawling the same url (on a crawler using Redis cache)

What is the expected behavior?

Even if the same url is added multiple times, I would expect there to be no duplicates. Should this be the case?

Please tell us about your environment:

Version: 1.8.0
Platform / OS version: Windows
Node.js version: 8.11.3

Jul 12 '18 17:07 lioreshai

I have the same problem, I confirm

Aug 02 '18 09:08 Minyar2004

Have you tried enabling the skipDuplicates and skipRequestedRedirect options in the queue options?

I believe that the current behavior is that it will crawl duplicate urls because it finds them as different request/response pairs. But if you enable those options it should make more duplicate requests. Please confirm if your problem was fixed this way.

Oct 17 '18 04:10 BubuAnabelas

@BubuAnabelas I set it up with skipDuplicates and skipRequestedRedirect but it's still able to be reproduced issue for me.

Have a feeling it's because of differing 'extraHeaders' maybe?

Any guidance here would be appreciated; am redis noob, and moreover just want to make my crawler more efficient and not hit the same pages once per worker.

Jun 25 '19 22:06 maschad96

Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.

First created a sqlite database.
Then in RequestStarted event, insert the current url.
In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.

Jun 19 '22 06:06 iamprageeth

headless-chrome-crawler headless-chrome-crawler copied to clipboard

Queueing same url on multiple workers in cluster with Redis cache results in duplicates

headless-chrome-crawler
headless-chrome-crawler copied to clipboard