headless-chrome-crawler
headless-chrome-crawler copied to clipboard
duplicated url are crawled twice
What is the current behavior?
Duplicated urls are not skipped. The same url is crawled twice.
If the current behavior is a bug, please provide the steps to reproduce
const HCCrawler = require('./lib/hccrawler');
(async () => {
const crawler = await HCCrawler.launch({
evaluatePage: () => ({
title: document.title,
}),
onSuccess: (result => {
/console.log(result);
}),
skipDuplicates: true,
jQuery: false,
maxDepth: 3,
args: ['--no-sandbox']
});
await crawler.queue([{
url: 'https://www.example.com/'
}, {
url: 'https://www.example.com/'
}]);
await crawler.onIdle();
await crawler.close();
})();
What is the expected behavior?
Crawled urls should be skipped even if they come from the queue
.
Please tell us about your environment:
- Version: lastest
- Platform / OS version: Centos 7.1
- Node.js version: v8.4.0
The reason might lie in helper.js:
static generateKey(options) {
const json = JSON.stringify(pick(options, PICKED_OPTION_FIELDS), Helper.jsonStableReplacer);
return Helper.hash(json).substring(0, MAX_KEY_LENGTH);
}
Uniqueness is assessed from a hash generated on the result of JSON.stringify()
, but this method doesn't guarantee constant order.
I'm looking for opinions. See https://github.com/substack/json-stable-stringify
Same as #299 @yujiosaka should look into this.
headless 模式下一直报302
I found two reasons:
-
maxConcurrency
> 1, same page requested in parallel threads. - Page that redirected will deduplicate source url, not target. You can skip these urls by setting
skipRequestedRedirect: true
is anyone consider creating a PR?
Just posting here hoping this would help someone. This is true it crawls duplicate URLs when concurrency > 1. So here is what I did.
- First created a sqlite database.
- Then in RequestStarted event, insert the current url.
- In preRequest function (You can pass this function along with options object) , just check whether there is a record of current url. If it is there that means url has crawler or still crawling. so return false. It will skip the url
- In RequestRetried, RequestFailed events, delete the url. So that will allows crawler to try it again.