pdf-redact-tools Suggestions

Suggestions

Open HulaHoopWhonix opened this issue 9 years ago • 4 comments

Hi Micah. I've been thinking about ways to sanitize media of all types that can allow high risk groups like journalists to handle untrusted files carrying malicious payloads.

Some of my suggestions are only relevant in the form of separate projects of their own to keep in line with the philosophy of making a tool do one thing and do it well. But they are all similar in scope and the way they work and can all be tied into a suite.

pdf-redact-tools:

I like how its platform agnostic and can be used by Whonix users across all supported hypervisors and TAILS/Linux bare-metal for air-gapped setups in comparison to a similar Qubes-only feature.

The design choices behind Qubes pdf-convert differ on some security and feature points. The Qubes blog post (1) is a recommended read: .png is further converted to .rgb to strip out any complex data or headers that could contain malicious data compared to other formats.

The process:

Untrusted vm/ pc [untrusted pdf > conversion to png > simple rgb format] > Transfer to trusted vm/ air-gapped pc [simple rgb > reassemble back to png > conversion back to trusted pdf]

The scripts (2) support image files too. Its a simpler variation of the pdf conversion process because its only the png (or supported image format) to rgb step.

Improvements over existing tools:

A shortcoming of the process outlines in (1) is the loss of pdf searchability. I am not sure if its in scope for pdf-redact-tools but pipelining the nonsearchable pdf output into hOCR (3) can solve it. This step should be optional though because the OCR software is not always perfect.

Untrusted vm/ pc [untrusted pdf > conversion to png > simple rgb format] > Transfer to trusted vm/air-gapped pc[simple rgb > reassemble back to png > optional ocr operation using tools (3 & 4) and cleanup > trusted but not searchable PDF > OCR software > hOCR2pdf > trusted and searchable PDF]

Sources (3 & 4) have some bash code and a respectable list of libre OCR and pdf manipulation gear.

Konrad Voelkel's link recommended scantailor and unpaper for enhancing scanned images to make them OCR friendly.

Down the list but important is adding a GUI to make it more accessible journalists.

Beyond Documents and Images: Video

With the above enhancements, sanitization and redaction of sensitive documents and images is feasible. Even documents not originally in PDF can be easily converted to the format using a variety of programs.

A similar process can be devised for video, but it makes sense to be implemented separately.

Untrusted vm/ pc [untrusted video > convert to raw video format YUV with libav handle audio same way] > Transfer to trusted vm/ air-gapped pc [simple YUV > libav combine raw audio and convert back to original video format > Trusted video]

Untrusted vm/ pc [untrusted video > extract and convert audio with libav] > Transfer to trusted vm/ air-gapped pc [simple .raw audio format > libav accept parameters of original audio and combine with raw video to convert back to original video format > Trusted Video]

1 http://theinvisiblethings.blogspot.com/2013/02/converting-untrusted-pdfs-into-trusted.html

2 https://github.com/QubesOS/qubes-app-linux-pdf-converter

3 http://www.konradvoelkel.com/2010/01/linux-ocr-and-pdf-problem-solved/

4 http://www.konradvoelkel.com/2013/03/scan-to-pdfa/

5 https://wiki.libav.org/Snippets/avconv

6 https://stackoverflow.com/questions/5194285/yuv-file-format

7 https://en.wikipedia.org/wiki/Raw_audio_format

8 https://unix.stackexchange.com/questions/25875/what-and-how-is-the-encoding-of-a-raw-headerless-audio-file

9 https://stackoverflow.com/questions/2059014/converting-raw-audio-data-to-wav-with-scripting

May 07 '15 18:05 HulaHoopWhonix

Edit: Added to original post how to handle audio.

For all types of files, the idea is to convert the original data into a simple headerless format to strip out any malformed data in them that can be abused to trigger a bug in the parsing program, whther its a pdf reader, an image viewer or a video player. The down side is these simple data files don't contain any information about the orignal file or how it was structured so they are not usable. To get around this, the conversion script should analyze and notify the user of things like pixel resolution, framerate or audio bitrate of the original data. When reconverting to the trusted form, the script should prompt the user to supply that information so the process is done properly and to prevent the converting program from being tricked into thinking the raw data is some other format.

May 09 '15 01:05 HulaHoopWhonix

Hey, thanks for putting so much thought into this!

The main security issue I have with pdf-redact-tools is that the "sanitizing" doesn't happen in isolation (like in a Qubes disposable vm) but rather on the same computer using imagemagick. So if the PDF is malicious and the malware targets imagemagick, you get hacked. But this is hard to avoid, and I want to ultimately make this usable by journalists on their Macs.

Another issue to consider is final document file size, and quality. If you have a 5mb PDF and use pdf-convert in Qubes, you end up with like a 40mb trusted PDF, which is just unweildy. But the image quality is really good. The image quality isn't always the best with pdf-redact-tools, but it makes the final file sizes not so terrible.

And yeah, OCR would be great. Since I'm aiming to make this cross-platform in Linux and OSX, it would be best if whatever OCR library was used was available in Homebrew for OSX too.

I recently showed pdf-redact-tools to a group of journalists and demoed redacting some documents. I think that the fact that there's no GUI makes it too hard for a lot of people, so that should definitely be added.

And if there's a GUI, maybe it makes sense to build the drawing-black-boxes functionality into that GUI too? That starts to make it sound like a much bigger project though, because often you run into situations where you may want to choose the color of your black boxes, you may want to add text (to describe what you redacted), etc. Sometimes when redacting a table of information you want to put some black boxes on a layer and duplicate the layer several times to redact each row exactly, which is all functionality that would take a lot of work to replicate compared to what's already available in GIMP or Photoshop.

May 12 '15 23:05 micahflee

And if there's a GUI, maybe it makes sense to build the drawing-black-boxes functionality into that GUI too?

I think any image editing is better left to GIMP to not complicate things. However pdf-redact-tools could preview each image/page and allow them to be opened into GIMP if the user clicks an "Edit" button in the GUI.

After more research I reached the conclusion that the sanitization concept is flawed and cannot provide the expected guarantees of generating trusted files - this is the case for all designs including isolation based ones. More on that below. So from now on I'll concentrate on the redacting aspect and ways to enhance usability.

pdf-redact-tools could offer to run the final files through the Metadata Anonymization Toolkit to strip out sensitive data about the system it was edited on and when. Maybe MAT could be made a dependency because its best practice to use it, especially in the journalist context.

There is no foolproof way of media sanitization and a smart adversary can account for it and make sure their crafted code survives format conversion and still ends up in the saved trusted PNG. Untrusted data should always be treated as such and only handled in a VM. Period.

Very cool research by this guy:

https://www.idontplaydarts.com/2012/06/encoding-web-shells-in-png-idat-chunks/

Some conclusions

Placing shells in IDAT chunks has some big advantages and should bypass most data validation techniques where applications resize or re-encode uploaded images. You can even upload the above payloads as GIFs or JPEGs etc. as long as the final image is saved as a PNG.

There are probably some better techniques you could use to hide the shell more convincingly and short of scanning each uploaded image for a shell there is probably not much you can do as a developer to stop it. I'd imagine that encoding a shell into a lossy format such as JPEG could be substantially harder - but probably not impossible.

May 15 '15 15:05 HulaHoopWhonix

Hello! Just a tiny note, maybe you want to add it to "Building PDF Redact Tools": On CentOS 7 (the only OS I tried out of curiosity), you need to install the EPEL repository (sudo yum install epel-release) before you can install perl-Image-ExifTool. I'm rather new to the RedHat-side of Linux, so this might seem obvious but it took me quite some ~~googling~~ startpaging to find out.

May 29 '15 08:05 loneum

pdf-redact-tools pdf-redact-tools copied to clipboard

Suggestions

pdf-redact-tools
pdf-redact-tools copied to clipboard