headless-chrome-crawler
headless-chrome-crawler copied to clipboard
Stuck for no reason?
i have a list that has 1.8k urls, but when
await crawler.queue(urls)
it seems stuck randomly without timeout?
const fs = require('fs')
const _ = require('lodash')
const writeJsonFile = require('write-json-file')
const HCCrawler = require('headless-chrome-crawler')
const RedisCache = require('headless-chrome-crawler/cache/redis')
const cache = new RedisCache({ host: '127.0.0.1', port: 6379 })
let urls = getUrls()
let count = urls.length
async function p1() {
const crawler = await HCCrawler.launch({
cache,
persistCache: true,
evaluatePage: (() => ({
title: $('#litZQMC').text(),
html: $('#divScroll').html()
})),
onSuccess: async resp => {
const { result: { title, html } } = resp
if (fs.existsSync(`files/${title}.txt`)) {
console.log('skip', count--, title)
} else {
await writeJsonFile(`files/${title}.txt`, html)
console.log('done', count--, title)
}
},
onError: err => {
console.log(err)
}
})
await crawler.queue(urls)
await crawler.onIdle()
await crawler.close()
}
async function queue() {
await p1()
}
queue()
- Version: 1.8.0
- Platform / OS version: osx
- Node.js version: v8.11.3
I have the same situation, but not randomly, it just stuck with chrome process killed after several minutes.
Did anyone find a solution or workaround?
No exception is thrown and no error is printed. I still have a few chrome processes running when the script gets stuck.
I discovered what causes blocks in my case.
The blocks happen when a tab is pointed toward (_page.goto()
) a page containing flash. There, the browser shows a warning dialog which is not detected by _handleDialog()
in crawler.js, and causes an infinite delay in _collectLinks()
.
Solution (works for me):
the first part of _collectLinks()
needs to be changed to:
/**
* @param {!string} baseUrl
* @return {!Promise<!Array<!string>>}
* @private
*/
async _collectLinks(baseUrl) {
const links = [];
await Promise.race([
new Promise(function(resolve, reject) {
setTimeout(resolve, 10000);
}),
this._page.exposeFunction('pushToLinks', link => {
const _link = resolveUrl(link, baseUrl);
if (_link) links.push(_link);
})
]);
console.log("PASSED");
Possibly this modification causes a memory leak, but it worksForMe.
Maybe @yujiosaka could look more into this since it's clearly an easy to reproduce and easy to fix bug.
Try to add args: ['--no-sandbox']
to crawler options.
is anyone willing making a PR?