Admin tool to bulk reprocess

Open hoyla opened this issue 3 months ago • 1 comments

Some things process much better now than hitherto

Some things we processed but didn't add value (e.g. a transcription or translation)

Others we didn't grab originally all the metadata we should have etc

But we don't know how many things failed or simply processed badly (e.g. PDFs pre OCR+page viewer)

It'd be good to have a tool to identify files by "failed", or by type, collection, workspace, modified date or combination thereof - and batch up reprocessing of the matches.

Sep 26 '25 13:09 hoyla

For example:

Bulk re-process 26,082 PDFs uploaded prior to Nov 2020 to make them page-viewable Use query from these notes: https://docs.google.com/document/d/19Ehd3cNvxIjH4hwWLUqdIDwdE8-KZpLswi43zZnSpXY/edit#heading=h.sbg5idqowqiConnect your Google account

es index ids, and S3 object keys, both changed in these:

https://github.com/guardian/pfi/pull/886Connect your Github account https://github.com/guardian/pfi/pull/884Connect your Github account

See here for instructions: https://docs.google.com/document/d/19Ehd3cNvxIjH4hwWLUqdIDwdE8-KZpLswi43zZnSpXY/edit#heading=h.sbg5idqowqiConnect your Google account

There seem to be not a crazy number to reprocess. These are all the PDFs that have never been attempted with OcrMyPdf:

match (b :Blob)-[:TYPE_OF]->(t :MimeType {mimeType: "application/pdf"})
where not (b)<-[]-(:Extractor {name: "OcrMyPdfExtractor"})
return count(distinct b)

26,082

https://docs.google.com/document/d/19Ehd3cNvxIjH4hwWLUqdIDwdE8-KZpLswi43zZnSpXY/edit#heading=h.sbg5idqowqiConnect your Google account

Sep 26 '25 13:09 hoyla