Safe link export

Open nicpaesk opened this issue 2 months ago • 1 comments

What is the feature you think should be a good addition to Dangerzone?

Link export: Parse the untrusted PDF inside the sandbox and export only safe link metadata (page index, rectangle, and uri for /URI actions) to a sidecar JSON.

Is your feature request related to a problem? Please describe.

Additional context

Dangerzone’s pixel rasterization deliberately removes interactivity (including links). That’s correct for security. However, we also need a safe way to keep link targets around without weakening the threat model.

This feature would also open the way for a follow up feature for safe "re-linking" when desired, such as an optional link index page at the end of the PDF.

Implementation sketch:

CLI flag (default off): dangerzone-cli input.pdf --export-links <path/to/links.json>
GUI: add an unchecked option “Export link list (JSON)”.
Optionally (via advanced flag), allow exporting internal destinations. Keep them in a separate array ("internal_links") with dest data, never mixed with uri links.
JSON schema: { "dangerzone_version": "X.Y.Z", "tool_versions": { "parser": "pymupdf <ver>" }, "source_sha256": "<hex>", "source_filesize": 123456, "page_count": 42, "pages": [ { "index": 0, "width_pt": 612.0, "height_pt": 792.0, "rotation_deg": 0, "links": [ { "rect_pt": [x0, y0, x1, y1], "uri": "https://example.org/path", "scheme": "https" } ] } ], "stats": { "uri_links": 12, "skipped_non_uri_actions": 3, "skipped_by_scheme": { "javascript": 2, "file": 1 } } }
Parsing approach : use the PyMuPDF in the sandbox and iterate get_links() (or equivalent).

Nov 04 '25 11:11 nicpaesk

That's an interesting idea, and I think we need something similar for the file metadata, which we currently don't keep, but they are very important to journalists. Also, this ties in with something else we had discussed in the past: https://github.com/freedomofpress/dangerzone/issues/763. Basically, what I'm saying is that if we were to introduce a mechanism like this, I would suggest making it broader in nature, so that we can support more use cases.

All-in-all, personally I'd like an option that gives users the ability to work on the original document in a safe manner somehow. Maybe now that we have a container image (see https://github.com/freedomofpress/dangerzone/pkgs/container/dangerzone%2Fv1/560273008?tag=latest), we can empower users to do something like:

Create a Dockerfile:

FROM ghcr.io/freedomofpress/dangerzone/v1:latest

COPY <myprog>  # A program that accepts the document as input and prints something to stdout

Build a container image based on ours, along with any code they need. Then, run the user's script in our gvisor-powered container image and get back the output:

podman build -t custom-dz .
cat suspicious.pdf | podman run <dangerzone options> custom-dz myscript > output.json

The trick of course is to offer the above in a manner that is safe from misuse, cross-platform, and document-agnostic.

Nov 05 '25 09:11 apyrgio