binaryornot
binaryornot copied to clipboard
Certain PDFs are not recognized as binaries
I noticed that binaryornot does not recognize certain PDFs as binaries. In particular, is_binary returns False for this preprint from BioRxiv. However, it does spot other PDFs that I tried as binaries.
Interesting, that PDF has '%PDF-1.4\r%' as its first few bytes so should be able to be picked up via that.
I have similar issues with PDF files that obviously contain binary data within them being interpreted as not binary with this library.
Any suggestions on how to resolve this one?
It looks like there is a posible false negative here. For the example PDF the first chunk is
b'%PDF-1.4\r%\xe2\xe3\xcf\xd3\r\n83 0 obj\n<</Metadata 97 0 R/Pages 5 0 R/Type/Catalog>>\nendobj\n97 0 obj\n<</Length 1534/Subtype/XML/Type/Metadata>>stream\r\n<?xpacket begin="\xef\xbb\xbf" id="W5M0MpCehiHzreSzNTczkc9d"?>\n<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.389687, 2009/06/02-13:20:35 ">\n <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">\n <rdf:Description rdf:about=""\n xmlns:xmp="http://ns.adobe.com/xap/1.0/">\n <xmp:CreateDate>2016-04-19T10:00:37Z</xmp:CreateDate>\n <xmp:CreatorTool>Word</xmp:CreatorTool>\n <xmp:ModifyDate>2017-08-08T09:14:29-07:00</xmp:ModifyDate>\n <xmp:MetadataDate>2017-08-08T09:14:29-07:00</xmp:MetadataDate>\n </rdf:Description>\n <rdf:Description rdf:about=""\n xmlns:pdf="http://ns.adobe.com/pdf/1.3/">\n <pdf:Keywords/>\n <pdf:Producer>Mac OS X 10.9.4 Quartz PDFContext</pdf:Producer>\n </rdf:Description>\n <rdf:Description rdf:about=""\n xmlns:dc="http://purl.org'
That gets a low ratio of low_chars
low_chars = b'\xe2\xe3\xcf\xd3\xef\xbb\xbf'
nontext_ratio1 = 0.0068359375
That gets a high ratio of high_chars
high_chars = 0.9931640625
It also gets a small confidence for detected encoding:
detected_encoding = {'encoding': 'ISO-8859-1', 'confidence': 0.506747572815534, 'language': ''}
It will also not pass this check
if b'\x00' in bytes_to_check or b'\xff' in bytes_to_check:
So it will pass as false