node-scraper
node-scraper copied to clipboard
Parallel scraping results in misses & duplicates
What an awesome scraper platform! Got all geared up in no time.
However, as I found single page scraping to work just fine, parallel scraping with many (I had 79) URLs fails, resulting in missed URLs and duplicates while the total sum of fetched URLs is correct.
I suspect the reason to be the queuing implementation. I tried a little fix on scraper.js that produced results I was hoping.
I too faced this problem. its just 20 urls, rather than 79.
Is there a way to enforce a timeout?
What I remember, I think timeout wouldn't help, I think I tried one arrangement. I'm not a JS guru, so I'm not deadsure what would be a rock solid fix for this, but mine worked for me at least :)
(Whoops, Comment & Close was kinda too close.)
Try using cheerio instead of jsdom and then implementing your own queuing, worked for me!