polytracker
polytracker copied to clipboard
Visualize Temporal Taint Patterns
Now that we maintain temporal information for when specific bytes are operated on, it would be interesting (although perhaps not useful) to visualize it as an animated GIF.
- Represent the input file as an image where each pixel in the image represents an associated byte in the input file
- Allow the user to specify an output image height, width, or aspect ratio on the command line
- If the user does not specify a height, width, or aspect ratio, default to an aspect ratio of 1.618
input_file_bytes: int = ...
aspect_ratio: float = 1.618 # overridable via command line
sqrt_ratio = math.sqrt(aspect_ratio)
sqrt_bytes = math.sqrt(input_file_bytes)
output_image_height: int = max(int(math.ceil(sqrt_bytes * sqrt_ratio)), 1)
output_image_width: int = max(int(math.ceil(sqrt_bytes / sqrt_ratio)), 1)
- Here is an example of how to generate an animated gif from Python using Pillow.
- Each time the byte of an input file is operated on, the associated pixel should be highlighted. We could then have a cool down function that gradually fades out that pixel over a certain number of frames.
- If we also have temporal information regarding the context of how the bytes are used (e.g., if they affect control flow or not), then we can color pixels differently based upon that.
Hi @ESultanik, I’ve been promoting something similar (a “heat map” of file reads) for a while… I see the following benefits:
- Visualize what a given parser reads/processes vs what it doesn’t (theoretically could then determine missing features)
- Visualize the basic classification of parser algorithm that is in use – breadth-first vs depth-first, on-demand vs big-bang/load-it-all-in
- Efficiency – how many times data is read/processed, “stride sizes” for reads
- Properties of algorithms in use – e.g. if text extraction uses tagged PDF semantics; if color processing uses ICC profiles
- For malformed files, what "fix-up logic" eventually resolves too (since this is usually undocumented) and whether different parsers agree Hope that gives you some additional motivation!
Hi Peter, thank you for the feedback.
Could you point me in the direction of a PDF that is malformed in a "fixable" way, and to a parser that will automatically perform this repair?
Sure (thinking of something like no xref) - do you have a 'short-list' of parsers you prefer?
Yes!
Parser short-list: MuPDF, QPDF, and Poppler.
Thank you.
polytracker.zip I hand-made some samples for you by hex-editing down a PDF: no xref, no startxref, no trailer, etc. Filenames are descriptive of the malform. QPDF certainly outputs different warning messages so you should be able to capture via PolyTracker the additional recovery mechanisms that fire. MuPDF/poppler also supports some (but not all!) of these malforms. I also provided the baseline PDF too (...-original.pdf) to make 'diff-ing' the processing easier for you. Let me know if you want more samples.