polytracker icon indicating copy to clipboard operation
polytracker copied to clipboard

Visualize Temporal Taint Patterns

Open ESultanik opened this issue 4 years ago • 5 comments

Now that we maintain temporal information for when specific bytes are operated on, it would be interesting (although perhaps not useful) to visualize it as an animated GIF.

  1. Represent the input file as an image where each pixel in the image represents an associated byte in the input file
  2. Allow the user to specify an output image height, width, or aspect ratio on the command line
  3. If the user does not specify a height, width, or aspect ratio, default to an aspect ratio of 1.618
input_file_bytes: int = ...
aspect_ratio: float = 1.618 # overridable via command line
sqrt_ratio = math.sqrt(aspect_ratio)
sqrt_bytes = math.sqrt(input_file_bytes)
output_image_height: int = max(int(math.ceil(sqrt_bytes * sqrt_ratio)), 1)
output_image_width: int = max(int(math.ceil(sqrt_bytes / sqrt_ratio)), 1)
  1. Here is an example of how to generate an animated gif from Python using Pillow.
  2. Each time the byte of an input file is operated on, the associated pixel should be highlighted. We could then have a cool down function that gradually fades out that pixel over a certain number of frames.
  3. If we also have temporal information regarding the context of how the bytes are used (e.g., if they affect control flow or not), then we can color pixels differently based upon that.

ESultanik avatar May 07 '20 17:05 ESultanik

Hi @ESultanik, I’ve been promoting something similar (a “heat map” of file reads) for a while… I see the following benefits:

  • Visualize what a given parser reads/processes vs what it doesn’t (theoretically could then determine missing features)
  • Visualize the basic classification of parser algorithm that is in use – breadth-first vs depth-first, on-demand vs big-bang/load-it-all-in
  • Efficiency – how many times data is read/processed, “stride sizes” for reads
  • Properties of algorithms in use – e.g. if text extraction uses tagged PDF semantics; if color processing uses ICC profiles
  • For malformed files, what "fix-up logic" eventually resolves too (since this is usually undocumented) and whether different parsers agree Hope that gives you some additional motivation!

petervwyatt avatar May 08 '20 04:05 petervwyatt

Hi Peter, thank you for the feedback.

Could you point me in the direction of a PDF that is malformed in a "fixable" way, and to a parser that will automatically perform this repair?

carsonharmon avatar May 08 '20 14:05 carsonharmon

Sure (thinking of something like no xref) - do you have a 'short-list' of parsers you prefer?

petervwyatt avatar May 09 '20 00:05 petervwyatt

Yes!

Parser short-list: MuPDF, QPDF, and Poppler.

Thank you.

carsonharmon avatar May 11 '20 11:05 carsonharmon

polytracker.zip I hand-made some samples for you by hex-editing down a PDF: no xref, no startxref, no trailer, etc. Filenames are descriptive of the malform. QPDF certainly outputs different warning messages so you should be able to capture via PolyTracker the additional recovery mechanisms that fire. MuPDF/poppler also supports some (but not all!) of these malforms. I also provided the baseline PDF too (...-original.pdf) to make 'diff-ing' the processing easier for you. Let me know if you want more samples.

petervwyatt avatar May 26 '20 08:05 petervwyatt