python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Add a scraper check utility

Open benoit74 opened this issue 1 year ago • 0 comments

Currently, we rely on various objects in scraperlib to:

  • create the ZIM
  • re-encode videos and images
  • cache these assets on the optimization cache

We might consider to have a mechanism to perform sanity checks on scraper behavior:

  • did we cached all re-encoded images / videos when a cache is present?
  • did we removed temporary files from the filesystem as they are added to the ZIM? (we know that while we prefer in-memory/streaming approaches, there are still many scrapers which are using the temporary file approach, and even some situation which have to rely on it)

What I do not yet know:

  • should we make the scraper fails if these checks fails?
  • is there any chance we automate these checks? (i.e. no need to modify the scrapers, or as little as possible - at least not make a call to "check_i_m_ok" mandatory, because the scraper developers might forget about it as well ; I doubt about this because there are many kind of situations)
  • can we do these checks early? (so that we fail the scraper asap instead of wasting time and resources)

benoit74 avatar Feb 05 '24 09:02 benoit74