dangerzone icon indicating copy to clipboard operation
dangerzone copied to clipboard

Evaluate Dangerzone's Potential as a Redaction Tool (and add redaction capabilities)

Open deeplow opened this issue 3 months ago • 3 comments

Dangerzone's goal is protecting the user against malware. However, thought the way it works, it also removes metadata. So it can also help with publication security.

The problem

Typical PDFs manipulation tools have poorly implemented redaction methods that can be reversed. Because Dangerzone already rasterizes documents, it has nothing to loose. When a black box is applied and then rasterized, there is no more information in the final output.

This is best put in the paper Story Beyond the Eye: Glyph Positions Break PDF Text Redaction (emphasis added):

Rasterization appears to be an effective defense against deredaction. In many cases this defense is infeasible be- cause it removes searchable text data from the document, however, performing OCR on the document post-redaction can act as a stop-gap for this issue. Rasterization algorithms may also modify or ignore certain glyph shifts,17 requiring the analyst to perform more reverse engineering to identify the specific rasterization tool used.

We're working on turning Dangerzone into a file view and that could be the perfect change to add redaction tools.

User Story

As a journalist, I'd like to have use dangerzone to help redact documents, ensuring that redactions cannot be reversed.

How could this work?

User journey:

  1. In the view mode user draws black squares over blacked out area
  2. After all redactions are done, the user saves the final document

Technical explanation: the host receives all the rasterized images. As the user adds a black box to the image, with the help of an image manipulation module (like Pillow) it adds those black boxes to the final image. If we want extra rasterization assurances, we can convert final PDF though dangerzone one more time to ensure proper rasterization.

Implementation Risks and Unmitigated Risks

We should keep in mind that redaction alone may not be to eliminate all unredaction risks. The best advice is never to publish source documents and if needed, to retype them. I can think of several other ways that redaction could still be bypassed:

  • invisible watermarks: if the purpose is to identify the leaker, then printer dots, space-width variations, etc. could all be used. No redaction can save this form of identification. Only document retyping can potentially help there.
  • character width can be used to reverse redactions (related paper)
  • compression artifacts can leave traces of what was hidden. In pre-compressed artifacts like images we cannot help much, as the whole element has to be redacted. However, dangerzone also compresses documents. We could make sure to only do this in the final rasterization (i.e. the one with the redaction boxes).

deeplow avatar Apr 02 '24 20:04 deeplow