athome-scraper icon indicating copy to clipboard operation
athome-scraper copied to clipboard

Parallelize fetch calls

Open helloitsjoe opened this issue 3 years ago • 3 comments
trafficstars

Great article! At the end you mention how slow the requests are, JavaScript makes parallelizing requests trivial with Promise.all. This change should do the job - the only caveat is I haven't used Puppeteer much, but I don't think there will be an issue with passing a single browser instance and making parallel calls to browser.newPage().

helloitsjoe avatar Apr 09 '22 19:04 helloitsjoe

Thank you Joe! Will try that out as soon as I get home.

By the way I've read about Promise.all and the only thing that concerns me is that it will spawn n parallel requests to the website which could consume a lot of memory since we're talking about spawning a new tab with Puppeteer (Chrome :sigh:) for each link. Do you know how I could use Promise.all but change its behaviour to make m max parallel requests at a time?

mattrighetti avatar Apr 09 '22 21:04 mattrighetti

Yeah, that's the one concern I would have with puppeteer. Promise.all doesn't natively support any kind of pooling or concurrency limit, but there are some libraries and solutions here: https://stackoverflow.com/questions/40639432/what-is-the-best-way-to-limit-concurrency-when-using-es6s-promise-all

p-limit in the top rated answer looks like a good, simple choice!

Another option: Since the data is embedded in the HTML, you don't really need Puppeteer at all - you could just make a request with node and parse the HTML to pull out the data. axios or node-fetch are popular promise-based libraries, it would look something like this:

async function scrape_json_data(link) {
  const html = await axios.get(link);
  const json = extractDataFromHtml(html);
  return json;
}

That way you wouldn't need to worry about memory consumption and limiting concurrency!

helloitsjoe avatar Apr 09 '22 22:04 helloitsjoe

By the way, no obligation to merge this PR! I only wanted to mention how you could parallelize those slow requests, but free to close the PR if you prefer :)

helloitsjoe avatar Apr 10 '22 01:04 helloitsjoe