Our lovely page viewer should be for more than just PDFs

Open hoyla opened this issue 3 months ago • 1 comments

Word documents (and more) should be ingested in such a way as they can be represented in giant's custom page viewer, so people don't get the jarring experience of different tools for viewing and searching different flavours of text document

Sep 25 '25 14:09 hoyla

c.f. from a dupe card:

From Joe:

… from a quick look at the (current) code it doesn't seem likely we ever converted word docs to PDFs at the point of ingestion. (I haven't trawled through the history of the two codebases though)

https://github.com/search?q=repo%3Aguardian%2Fgiant%20%22application%2Fmsword%22&type=code

On the extractor (i.e. ingestion) side, the mime type is only mentioned in the very generic DocumentBodyExtractor which just tries to grab already-readable text from various different file types.

Interestingly though, we do actually convert to Libre Office compatible docs to PDF at the point of preview: https://github.com/guardian/giant/blob/c0d40e8002426561f1c4bb447223d0e326c4cb9e/backend/app/services/previewing/LibreOfficePreviewGenerator.scala#L24

Given this, it would probably be pretty easy to create a LibreOfficeExtractor and have it do this step on extraction, then trigger OcrMyPdf on the resulting PDF to get the effect you want

Sep 25 '25 17:09 hoyla