python-scraperlib
python-scraperlib copied to clipboard
Add a scraper check utility
Currently, we rely on various objects in scraperlib to:
- create the ZIM
- re-encode videos and images
- cache these assets on the optimization cache
We might consider to have a mechanism to perform sanity checks on scraper behavior:
- did we cached all re-encoded images / videos when a cache is present?
- did we removed temporary files from the filesystem as they are added to the ZIM? (we know that while we prefer in-memory/streaming approaches, there are still many scrapers which are using the temporary file approach, and even some situation which have to rely on it)
What I do not yet know:
- should we make the scraper fails if these checks fails?
- is there any chance we automate these checks? (i.e. no need to modify the scrapers, or as little as possible - at least not make a call to "check_i_m_ok" mandatory, because the scraper developers might forget about it as well ; I doubt about this because there are many kind of situations)
- can we do these checks early? (so that we fail the scraper asap instead of wasting time and resources)