binaryornot icon indicating copy to clipboard operation
binaryornot copied to clipboard

Certain PDFs are not recognized as binaries

Open rhliang opened this issue 9 years ago • 4 comments

I noticed that binaryornot does not recognize certain PDFs as binaries. In particular, is_binary returns False for this preprint from BioRxiv. However, it does spot other PDFs that I tried as binaries.

rhliang avatar Jul 05 '16 18:07 rhliang

Interesting, that PDF has '%PDF-1.4\r%' as its first few bytes so should be able to be picked up via that.

timClicks avatar Jul 16 '16 22:07 timClicks

I have similar issues with PDF files that obviously contain binary data within them being interpreted as not binary with this library.

Jitsusama avatar Jun 05 '17 14:06 Jitsusama

Any suggestions on how to resolve this one?

pydanny avatar Aug 01 '17 23:08 pydanny

It looks like there is a posible false negative here. For the example PDF the first chunk is b'%PDF-1.4\r%\xe2\xe3\xcf\xd3\r\n83 0 obj\n<</Metadata 97 0 R/Pages 5 0 R/Type/Catalog>>\nendobj\n97 0 obj\n<</Length 1534/Subtype/XML/Type/Metadata>>stream\r\n<?xpacket begin="\xef\xbb\xbf" id="W5M0MpCehiHzreSzNTczkc9d"?>\n<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.389687, 2009/06/02-13:20:35 ">\n <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">\n <rdf:Description rdf:about=""\n xmlns:xmp="http://ns.adobe.com/xap/1.0/">\n <xmp:CreateDate>2016-04-19T10:00:37Z</xmp:CreateDate>\n <xmp:CreatorTool>Word</xmp:CreatorTool>\n <xmp:ModifyDate>2017-08-08T09:14:29-07:00</xmp:ModifyDate>\n <xmp:MetadataDate>2017-08-08T09:14:29-07:00</xmp:MetadataDate>\n </rdf:Description>\n <rdf:Description rdf:about=""\n xmlns:pdf="http://ns.adobe.com/pdf/1.3/">\n <pdf:Keywords/>\n <pdf:Producer>Mac OS X 10.9.4 Quartz PDFContext</pdf:Producer>\n </rdf:Description>\n <rdf:Description rdf:about=""\n xmlns:dc="http://purl.org'

That gets a low ratio of low_chars low_chars = b'\xe2\xe3\xcf\xd3\xef\xbb\xbf' nontext_ratio1 = 0.0068359375

That gets a high ratio of high_chars high_chars = 0.9931640625

It also gets a small confidence for detected encoding: detected_encoding = {'encoding': 'ISO-8859-1', 'confidence': 0.506747572815534, 'language': ''}

It will also not pass this check if b'\x00' in bytes_to_check or b'\xff' in bytes_to_check:

So it will pass as false

theodesp avatar Aug 08 '17 16:08 theodesp