puremagic icon indicating copy to clipboard operation
puremagic copied to clipboard

PDF files are not always detected

Open peterekepeter opened this issue 1 year ago • 1 comments

From my testing the %PDF- does not necessarily have to be at offset 0. It can be located anywhere in the file. For example I can type some junk into the file in the beginning and it still opens file.

I received multiple files like this from people, so there is something or someone out in the wild that adds extra characters in front of the magic sequence.

A detector would look something like that it searches for a substring inside a search window:

def is_pdf(file_path):
    with open(file_path, "rb") as file:
        # may throw IOError
        header = file.read(1024)
        return b"%PDF-" in header

From what I see currently the library is not built to handle this kind of situation. So I'm leaving this ticket here with this code snippet in case more advanced detection is implemented.

peterekepeter avatar Jul 17 '24 12:07 peterekepeter

Just to make sure, I did check out the PDF specifications themselves:

The PDF file begins with the 5 characters “%PDF–” and byte offsets shall be calculated from the PERCENT SIGN (25h). NOTE 1 This provision allows for arbitrary bytes preceding the %PDF- without impacting the viability of the PDF file and its byte offsets.

So it is valid for PDFs to not strictly start with the %PDF- but must contain it in their header. Will work on a better way to detect this.

cdgriffith avatar Sep 28 '24 22:09 cdgriffith

Should work a lot better in version 2, starting the beta now! https://github.com/cdgriffith/puremagic/releases/tag/2.0.0b1

cdgriffith avatar May 04 '25 22:05 cdgriffith