dangerzone Perform on-host conversion for the pixels to PDF stage

This PR introduces a fundamental change in the way Dangerzone processes documents. Instead of first grabbing all of the pixel data from the first container, storing them on disk, and then reconstructing the PDF on a second container, Dangerzone now immediately reconstructs the PDF on the host, while the doc to pixels conversion is still running on the first container. The sanitzation is no less safe, since the boundaries between the sandbox and the host are still respected.

What we gain is that we no longer use mounts, and we have much faster conversions, especially on Windows and macOS.

Fixes #625

[!NOTE] This PR still has some rough edges. Off the top of my head, we need to:

[ ] Test the changes across all of our supported platforms, and fix all of our CI errors.

[ ] Remove tool.poetry.group.container.dependencies section from pyproject.toml, as it's duplicated info.

[x] Remove --userns keep-id option in Podman.

[x] Make donwload-tessdata.py cacheable in our CI runs.

[ ] Turn OCR language deps into recommendations in Linux systems, and handle if some are not installed.

[ ] Improve our Dummy isolation provider, so that the steps that run in the host actually run in our Windows / macOS CI runners.

[ ] Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.

[ ] Update our wording in various places, so that we no longer refer to using two containers for the sanitization.

[ ] Draft an ARCHITECTURE.md, which will be the source of truth on how Dangerzone works now.

All these cannot be tackled in a single PR, but we at least need to have issues for the ones we won't tackle immediately, before merging this PR.

Mar 14 '24 11:03 apyrgio

I'll reply to some of your observations as well:

the GUI code is crashing on me. I think the latest PySide6 version on pypi is broken.

In my Fedora 39 dev environment, the GUI seems to work. Can you provide the error log?

ubuntu focal - how do we solve lack of support? PyMuPDF in a virtualenv?

I was thinking of either reusing PyMuPDF within the container, or using Tesseract just for Ubuntu Focal. I'll let you know.

dummy can have pixels_to_pdf removed

Yeap, you're right.

Progress text improvements: because the conversion to PDF is now native and so fast, maybe we could replace the two log lines by one saying "converting page X". And when OCR is used we could say "making page X searchable"

Yeap, you're right.

Mar 27 '24 15:03 apyrgio

Update our packaging logic so that we don't include share/tessdata in our .debs / .rpms.

I worked on this. The code is in the branch 625-host-stream-tessdata-packaging. A lot of stuff had to be moved and I didn't manage to finish testing this week. I tested on fedora and debian and it seems to be building fine. The only thing is that it includes the .gitkeep in share/container.

On macOS it seems to be failing but I haven't had time to investigate. If you have the chance before me, feel free to continue where I left @apyrgio.

Mar 28 '24 17:03 deeplow

The PR is ready for review once more. The commit messages may require a bit more :heart: and make lint complains, but other than that, it's as ready and tested as it can be.

Oct 08 '24 18:10 apyrgio