athome-scraper
athome-scraper copied to clipboard
Parallelize fetch calls
Great article! At the end you mention how slow the requests are, JavaScript makes parallelizing requests trivial with Promise.all. This change should do the job - the only caveat is I haven't used Puppeteer much, but I don't think there will be an issue with passing a single browser instance and making parallel calls to browser.newPage().
Thank you Joe! Will try that out as soon as I get home.
By the way I've read about Promise.all and the only thing that concerns me is that it will spawn n parallel requests to the website which could consume a lot of memory since we're talking about spawning a new tab with Puppeteer (Chrome :sigh:) for each link. Do you know how I could use Promise.all but change its behaviour to make m max parallel requests at a time?
Yeah, that's the one concern I would have with puppeteer. Promise.all doesn't natively support any kind of pooling or concurrency limit, but there are some libraries and solutions here: https://stackoverflow.com/questions/40639432/what-is-the-best-way-to-limit-concurrency-when-using-es6s-promise-all
p-limit in the top rated answer looks like a good, simple choice!
Another option: Since the data is embedded in the HTML, you don't really need Puppeteer at all - you could just make a request with node and parse the HTML to pull out the data. axios or node-fetch are popular promise-based libraries, it would look something like this:
async function scrape_json_data(link) {
const html = await axios.get(link);
const json = extractDataFromHtml(html);
return json;
}
That way you wouldn't need to worry about memory consumption and limiting concurrency!
By the way, no obligation to merge this PR! I only wanted to mention how you could parallelize those slow requests, but free to close the PR if you prefer :)