zimit icon indicating copy to clipboard operation
zimit copied to clipboard

How to scrape large websites in a reasonable manner

Open benoit74 opened this issue 7 months ago • 0 comments

Scraping large website (millions of pages) is challenging because:

  • since the scrape takes long to complete, the chance the website changes during the crawl is significant:
    • this can cause small issues like some pages missing or outdated compared to the rest of the corpus
    • this can cause more serious issues like broken links due to some pages been moved during the crawl
  • since the scrape takes long to complete, it is complex to run on the Zimfarm

One example of such a website is https://forums.gentoo.org/ where it looks like we have between 1 and 6M of pages to crawl. See https://github.com/openzim/zim-requests/issues/1057

Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.

So far now, I don't know how we can handle to crawl such big sites in a reasonable manner

benoit74 avatar Jul 01 '24 12:07 benoit74