pdfplumber
pdfplumber copied to clipboard
Explicitly typecast `fontname` and `text` fields to str for char objects
This PR fixes #461. The issue was happening because, in the PDF, there was a char object as
{
"fontname": b"\xcb\xce\xcc\xe5",
"adv": Decimal("5.000"),
"upright": True,
"x0": Decimal("-146.900"),
"y0": Decimal("782.030"),
"x1": Decimal("-141.900"),
"y1": Decimal("792.030"),
"width": Decimal("5.000"),
"height": Decimal("10.000"),
"size": Decimal("10.000"),
"object_type": "char",
"page_number": 1,
"stroking_color": [1, 1, 1],
"non_stroking_color": [0, 0, 0],
"text": "B",
"top": Decimal("49.970"),
"bottom": Decimal("59.970"),
"doctop": Decimal("49.970"),
}
and since the fontname was of type bytes and not str, the sorted method threw an error as not all the fontnames were having the same type.
As a fix, I made it such that when fetching the chars, we do an explicit type cast to a string of fields text and fontname.
@jsvine My concern with this is that pdfplumber will be altering the fontname as provided in the PDF in which case, a better solution might be to only to do the typecast when calling the sorted(...) method in yield_unique_chars(...). What do you think?
Codecov Report
Merging #462 (5311913) into develop (002500a) will increase coverage by
0.00%. The diff coverage is100.00%.
:exclamation: Current head 5311913 differs from pull request most recent head ca8bfbb. Consider uploading reports for the commit ca8bfbb to get more accurate results
@@ Coverage Diff @@
## develop #462 +/- ##
========================================
Coverage 98.28% 98.29%
========================================
Files 10 10
Lines 1227 1232 +5
========================================
+ Hits 1206 1211 +5
Misses 21 21
| Impacted Files | Coverage Δ | |
|---|---|---|
| pdfplumber/container.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 002500a...ca8bfbb. Read the comment docs.
Thanks, @samkit-jain! A very interesting issue here. I've spent some time looking at it, and what follows is my understanding.
As you note, it's just one character that's causing problems, and the problem is due to the fontname property being represented as bytes rather than a string. Here's why I think that's happening:
- In the raw PDF, the
FontDescriptoris defined as such:
<</Type/FontDescriptor/FontName/BABGUC+#cb#ce#cc#e5/FontBBox[-7 -140 1000 859]/Flags 65568
/Ascent 859
/CapHeight 859
/Descent -140
/ItalicAngle 0
/StemV 150
/CIDSet 65 0 R
/FontFile2 53 0 R>>
-
... so the
/FontNameentry in that descriptor dictionary is the PostScript literal/BABGUC+#cb#ce#cc#e5 -
When
pdfminer.six(andpdfminerbefore it) encounters a#inside a PostScript literal, it interprets the two subsequent characters as a hexadecimal-encoded byte. From what I understand, it also tries interpreting all PostScript literals asutf-8strings, but allows un-decodable strings to pass through asbytesobjects. So I believe that's why/BABGUC+#cb#ce#cc#e5is coming through asb"\xcb\xce\xcc\xe5". (Maybe there's a logic flaw that leads to theBABGUC+part being missing?)
From what I can tell online, however, PostScript FontNames are supposed to be strings and not contain #-hex-hex encodings, though I'm not 100% sure.
Since this is just one character in one PDF, and we haven't encountered this issue before (despite many people using pdfplumber to parse documents with non-Latin alphabets), I'd lean toward not adding code to handle it. Instead, the original issue might be addressed by adding these two lines inside the for page loop:
for char in page.objects["char"]:
char["fontname"] = str(char["fontname"])
What do you think?
Hi @jsvine This is very insightful. Yes, it could be that the font name was in a non-UTF-8 format, say UTF-16. I also tried repairing the PDF using Ghostscript but it didn't have any effect. I agree that it is a rare case but I am not sure if not adding a fix in the library is the right way to go. But, yes, can wait and see if more people report a similar issue.
Sounds good — let's wait to see whether this may be a more widespread issue with some PDFs.
Thanks, @samkit-jain! A very interesting issue here. I've spent some time looking at it, and what follows is my understanding.
As you note, it's just one character that's causing problems, and the problem is due to the
fontnameproperty being represented as bytes rather than a string. Here's why I think that's happening:
- In the raw PDF, the
FontDescriptoris defined as such:<</Type/FontDescriptor/FontName/BABGUC+#cb#ce#cc#e5/FontBBox[-7 -140 1000 859]/Flags 65568 /Ascent 859 /CapHeight 859 /Descent -140 /ItalicAngle 0 /StemV 150 /CIDSet 65 0 R /FontFile2 53 0 R>>
- ... so the
/FontNameentry in that descriptor dictionary is the PostScript literal/BABGUC+#cb#ce#cc#e5- When
pdfminer.six(andpdfminerbefore it) encounters a#inside a PostScript literal, it interprets the two subsequent characters as a hexadecimal-encoded byte. From what I understand, it also tries interpreting all PostScript literals asutf-8strings, but allows un-decodable strings to pass through asbytesobjects. So I believe that's why/BABGUC+#cb#ce#cc#e5is coming through asb"\xcb\xce\xcc\xe5". (Maybe there's a logic flaw that leads to theBABGUC+part being missing?)From what I can tell online, however, PostScript
FontNames are supposed to be strings and not contain#-hex-hexencodings, though I'm not 100% sure.Since this is just one character in one PDF, and we haven't encountered this issue before (despite many people using
pdfplumberto parse documents with non-Latin alphabets), I'd lean toward not adding code to handle it. Instead, the original issue might be addressed by adding these two lines inside thefor pageloop:for char in page.objects["char"]: char["fontname"] = str(char["fontname"])What do you think?
still doesn't work all the time
Ended up implementing a similar, but slightly different version of this, based on some research about the common bytes-typed fontnames we saw cropping up in issues/discussions: https://github.com/jsvine/pdfplumber/pull/862/commits/9441ff7628fff9f69d81c6afd8ef439bf101b254
Thank you for getting the ball rolling on this, @samkit-jain!