zimit How to scrape large websites in a reasonable manner

How to scrape large websites in a reasonable manner

Open benoit74 opened this issue 7 months ago • 0 comments

Scraping large website (millions of pages) is challenging because:

since the scrape takes long to complete, the chance the website changes during the crawl is significant:
- this can cause small issues like some pages missing or outdated compared to the rest of the corpus
- this can cause more serious issues like broken links due to some pages been moved during the crawl
since the scrape takes long to complete, it is complex to run on the Zimfarm

One example of such a website is https://forums.gentoo.org/ where it looks like we have between 1 and 6M of pages to crawl. See https://github.com/openzim/zim-requests/issues/1057

Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.

So far now, I don't know how we can handle to crawl such big sites in a reasonable manner

Jul 01 '24 12:07 benoit74

zimit zimit copied to clipboard

How to scrape large websites in a reasonable manner

zimit
zimit copied to clipboard