node-scraper icon indicating copy to clipboard operation
node-scraper copied to clipboard

Parallel scraping results in misses & duplicates

Open deggis opened this issue 13 years ago • 4 comments

What an awesome scraper platform! Got all geared up in no time.

However, as I found single page scraping to work just fine, parallel scraping with many (I had 79) URLs fails, resulting in missed URLs and duplicates while the total sum of fetched URLs is correct.

I suspect the reason to be the queuing implementation. I tried a little fix on scraper.js that produced results I was hoping.

deggis avatar Jun 13 '11 21:06 deggis

I too faced this problem. its just 20 urls, rather than 79.

Is there a way to enforce a timeout?

gaara87 avatar Jun 27 '11 12:06 gaara87

What I remember, I think timeout wouldn't help, I think I tried one arrangement. I'm not a JS guru, so I'm not deadsure what would be a rock solid fix for this, but mine worked for me at least :)

deggis avatar Jun 28 '11 16:06 deggis

(Whoops, Comment & Close was kinda too close.)

deggis avatar Jun 28 '11 16:06 deggis

Try using cheerio instead of jsdom and then implementing your own queuing, worked for me!

nickewansmith avatar Jun 13 '13 17:06 nickewansmith