gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

Fail the scraper run when RDF fails to download

Open benoit74 opened this issue 1 month ago • 0 comments

Currently, when an RDF fails to download scraper just writes a warning and move on with next file.

In current situation, all RDF files are expected to be present, so I assume it would be safe to fail the scrape when an RDF file fails to download since it indicates a probable upstream issue.

At least last full successful run of gutenberg_mul_all indicate no issue when downloading RDF files. Some books have been ignored because they are not available in any format, but this is expected, these are audio books.

Problem of ignoring RDF download errors is what happened in https://farm.openzim.org/pipeline/7023df29-a93b-4148-bc3b-d708ae6be2bd : mirror went offline for 16 minutes and the scraper continued. Should I have not been notified by mirror owner, I would not have cancelled the run and we would have had a ZIM with some books missing (but probably still consistent).

Scraper should hence immediately stop when an RDF file fails to download (after backoff, to ignore small intermittent networking issues).

benoit74 avatar Nov 25 '25 08:11 benoit74