Handle cases when LibreOffice hangs
When running Dangerzone against our large test set, we found that some files (e.g., fdo78883.docx and ofz21168-1.doc) make LibreOffice 7.6 hang.
We opened a bug report for these files, but until the underlying issue is solved, we need a way to detect such hangs, and stop the conversion.
Re-introducing timeouts for the whole document is a solution I'd personally like to avoid. They have bitten us a lot in the past (#749), they are arbitrary (documents with many pages lead to very large timeout times), and we have recently decided to ditch them altogether (#687).
What makes more sense to me is the following:
- Get the number of pages in the document. Here's a way to do so: https://askubuntu.com/questions/305633/how-can-i-determine-the-page-count-of-odt-doc-docx-and-other-office-documents
- Spin up an UNO server (https://github.com/unoconv/unoserver). This server is responsible for loading a document, and listening for API requests.
- Send API requests to the UNO server, and ask to convert the document a single page at a time.
- See some supported export parameters that LibreOffice offers by default, and the blog post by the LibreOffice dev that added those. The
PageRangeoption is of interest to us. - The UNO server project provides a command-line client (
unoconvert), but maybe we can send these API requests programmatically.
- See some supported export parameters that LibreOffice offers by default, and the blog post by the LibreOffice dev that added those. The
- Set a timeout for each API request. Since at the API request level we know we're dealing with a signle document page, we can set a sensible timeout (e.g., 3 minutes). Anecdotally, converting a .docx of ~2000 pages took in my laptop 18 minutes, so this timeout is more than reasonable.
Some extra benefits of this approach:
- UNO server has an option to send files back as binary data, instead of writing them to the filesystem. This will help with #633.
- We aim to introduce file previewing in Dangerzone (see #758 for a PoC). One concern we have with file previewing is that LibreOffice documents may take a while to be converted to PDFs, so that we can stream their pixels afterwards. With this method, we can start streaming from the very first page.
- Compared with providing
PageRangearguments via the LibreOffice CLI, UNO server loads the document in memory once, so it offers faster conversion times for documents with lots of pages.
(Leaving unmilestoned for now given lower potential impact)
Adding ofz21385-1.doc to the list of naughty documents. I haven't opened a bug report for this yet, since I haven't tested if it works on the newest LibreOffice version.