zimit
zimit copied to clipboard
How to scrape large websites in a reasonable manner
Scraping large website (millions of pages) is challenging because:
- since the scrape takes long to complete, the chance the website changes during the crawl is significant:
- this can cause small issues like some pages missing or outdated compared to the rest of the corpus
- this can cause more serious issues like broken links due to some pages been moved during the crawl
- since the scrape takes long to complete, it is complex to run on the Zimfarm
One example of such a website is https://forums.gentoo.org/ where it looks like we have between 1 and 6M of pages to crawl. See https://github.com/openzim/zim-requests/issues/1057
Most pages are however static, i.e. they rarely change from one crawl to the next one, so some caching could definitely help, but I have no idea how we could implement this.
So far now, I don't know how we can handle to crawl such big sites in a reasonable manner