python-scraperlib
python-scraperlib copied to clipboard

Published 20 hours ago •

Reame
Issues

Add a scraper check utility

Open benoit74 opened this issue 1 year ago • 0 comments

Currently, we rely on various objects in scraperlib to:

create the ZIM
re-encode videos and images
cache these assets on the optimization cache

We might consider to have a mechanism to perform sanity checks on scraper behavior:

did we cached all re-encoded images / videos when a cache is present?
did we removed temporary files from the filesystem as they are added to the ZIM? (we know that while we prefer in-memory/streaming approaches, there are still many scrapers which are using the temporary file approach, and even some situation which have to rely on it)

What I do not yet know:

should we make the scraper fails if these checks fails?
is there any chance we automate these checks? (i.e. no need to modify the scrapers, or as little as possible - at least not make a call to "check_i_m_ok" mandatory, because the scraper developers might forget about it as well ; I doubt about this because there are many kind of situations)
can we do these checks early? (so that we fail the scraper asap instead of wasting time and resources)

Feb 05 '24 09:02 benoit74