pdfmajor
pdfmajor copied to clipboard
PDFFont exception
The enclosed PDF CompilationMapLegend.pdf threw an exception in PDFFont line 140 with cid=31. Once the exception happens I can no longer process any information on that page. It would be nice to find a way to throw an exception to notify there was a bad character, but then ignore that character and continue processing. I worked around the problem by replacing the re-raise with returning self.cid2unicode[32] (space).
This document had multiple occurrences of bad characters, all cid=31. They all appeared to happen at the start of a string. The first occurrence was on PDF page 3 (page_num=2). I do not understand CMap or CID. ASCII 31 is "unit separator". In ASCII this appears to be a "divider" for plain text and probably could be ignored in my context. My assumption is these bad characters are really part of the text, but another possibility is there is a parsing problem.
This issue requires a bit of thought. I am thinking that a good fix for this would be to include an option during parsing that would allow "bad characters".
Perhaps something like this
PDFInterpreter("<filename>", ignore_bad_chars=True)
What this would do is whenever a PDFUnicodeNotDefined should be thrown, I will instead return a null character '', and keep the parsing process going.
Let me know what you think.
As for the mapping issue, not sure what to say, will need to explore it more deeply.
Thanks for thinking about this. I agree it requires some thought. I think your proposed solution is good. It preserves the simple interface of pdfmajor.
When I understood my pdfmajor exceptions were due to bad characters my first thought was “where are the bad characters” and “are they important”. I first substituted a
I thought about other methods that would allow more troubleshooting of a PDF with unsupported characters but I think these would be too complicated. For instance having an interpreter parameter that allows the user to assign a bad character handler. “def bad_character_handler(cid) -> unicode”. This would give the user the ability to see the bad cid and assign any character, or ‘’, as the substitute. Even without knowing what a cid is the user would have the ability to count how many there are. If only 1 character is bad no big deal. If 50% are bad big deal.
Another option that is more like your proposal is:
PDFInterpreter("
None : throws exception on bad character as today
‘’ : empty string ignores bad character
‘
This would allow me to do what I did to troubleshoot without having to modify the pdfmajor source. Would other people use this extra functionality? Probably only a few.
I do not understand CMaps or CID well. Are there some CID with values 0..31 that are safe to always ignore? That would fix my problem and maybe help others avoid any exception. It would not fix problems with other bad character stopping processing on on PDF page.
Do not be too influenced by me. I am an amateur hack. I really like pdfmajor and looking through the source it looks well designed. I trust your judgement more than my own. Do what you think is best. I liked the package and just wanted to provide some feedback for you to consider. I am available if you want to use me as a sounding board other concepts.
BTW. PDFMajor was the only PDF package I found that could handle color and it was the simplest to use. Once I worked around the bad characters it did exactly what I needed (get the location of text and rects, get the fill color of rects). Unfortunately the very first PDF I picked as a test case had problems.
-brett
On Dec 7, 2019, at 10:34 PM, Ari Sosnovsky [email protected] wrote:
This issue requires a bit of thought. I am thinking that a good fix for this would be to include an option during parsing that would allow "bad characters".
Perhaps something like this
PDFInterpreter("
", ignore_bad_chars=True) What this would do is whenever a PDFUnicodeNotDefined should be thrown, I will instead return a null character '', and keep the parsing process going. Let me know what you think.
As for the mapping issue, not sure what to say, will need to explore it more deeply.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/asosnovsky/pdfmajor/issues/2?email_source=notifications&email_token=ANKCUCOEZ4NWI7I3YY7YOQTQXR2OZA5CNFSM4JW3NT62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGVXNY#issuecomment-562912183, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKCUCMSSZRD5XSDG7C7HRLQXR2OZANCNFSM4JW3NT6Q.
Thanks for the kind words.
I like the idea of adding a counter to the number of wrong characters. I could implement an error-state and keep track of the number of bad-characters (as well as other errors that could propagate). Then i could raise a warning if anything has been logged to the error-state (plus make it accessible via PDFInterpreter.errors, where I could store some contextual information about the bad-characters, like where they were found).
So it would likely be a class like this
class ErrorState:
bad_chars: List[BadChar]
class BadChar(NamedTuple):
cid: int
font: PDFFont
lead_chars: List[str]
follow_chars: List[str]
page_num: int
What do you think? In the meanwhile I have implemented ignore_bad_chars flag, so that you can run this clearly, but it only replaces the bad-characters with "".
Some context, this library is actually just a refactoring project of the pdfminer.six project. Specifically, because I needed access to color and reliable rect positioning information (in the way I also took the liberty on modding the structure of repo, and added a bit of a speed-boost and more reliable html convertion). But looking back at this repo (I have not touched it in a few months now), I see some things that could be improved.
Given so, my understanding of CMap might not be any better than yours. The most I know is that it is a mapping of characters within the document. So if a cid is not found, that would mean that either;
- I did not port the implementation well enough
- the original writers in pdfminer made a mistake
- the tool you used to make the pdf errored while writing the table and some of rows are missing (or perhaps the character itself was written wrong)
- the pdf document was corrupted at one point
I think your proposed solution is good. I think it provides everything to troubleshoot but is transparent if you do not want to troubleshoot.
I am not sure I understand the type “List[str]” for lead_chars and follow_chars. Would the str be limited to the text within an LTCharBlock? My initial thought was this could be just str but I am not sure what text you intend to include. Is the reason for List because you would break up the text with a LTCharBlock into chunks delimited by a bad chars?
I am not sure how difficult this would be to implement but consider another option:
class BadChar(NamedTuple): cid: int page_num: int char_block: LTCharBlock (or maybe char_block_text: str) position: int (index into charblock text)
If it is problematic or inefficient for the parser to retain every instance of LTCharBlock with errors you could used the alternate of just str. Feel free to modify my variable names.
A few comments about your thinking if cid is not found.
I ran the PDF I included through "pdfminer 20191125” and I did not encounter any exceptions or obvious errors. I do not know how this differs from pdfminer.six. At the time I didn’t realize there were questionable characters in the PDF and I never went back to try compare text to see if it was simply ignoring those characters. After I concluded it did not support colors I never spent much time with it. If you look at the PDF metadata it appears to be produced by a popular PDF product. This could certainly have bugs but if so many PDF’s might have this. PDF Producer: Acrobat Distiller 7.0 (Windows) Content Creator: PScript5.dll Version 5.2.2 The metadata also has a Title “ClassSymbolReport”. This hints at a script. The data may have been coming from a GIS program or database program (Access, MySQL, etc) and the bad chars might be some byproduct of the script or data source. Other applications like macOS Preview or Adobe Acrobat Reader open the document without issues. I do not suspect the document has been corrupted after its creation. For reference the PDF was part of the ZIP archive linked on <http://repository.azgs.az.gov/uri_gin/azgs/dlio/1615 http://repository.azgs.az.gov/uri_gin/azgs/dlio/1615>. It was created by someone at the Arizona Geologic Survey in 2015.
Hope this helps.
-brett
On Dec 12, 2019, at 8:40 PM, Ari Sosnovsky [email protected] wrote:
Thanks for the kind words.
I like the idea of adding a counter to the number of wrong characters. I could implement an error-state and keep track of the number of bad-characters (as well as other errors that could propagate). Then i could raise a warning if anything has been logged to the error-state (plus make it accessible via PDFInterpreter.errors, where I could store some contextual information about the bad-characters, like where they were found).
So it would likely be a class like this
class ErrorState: bad_chars: List[BadChar]
class BadChar(NamedTuple): cid: int lead_chars: List[str] follow_chars: List[str] page_num: int What do you think? In the meanwhile I have implemented ignore_bad_chars flag, so that you can run this clearly, but it only replaces the bad-characters with "".
Some context, this library is actually just a refactoring project of the pdfminer.six project. Specifically, because I needed access to color and reliable rect positioning information (in the way I also took the liberty on modding the structure of repo, and added a bit of a speed-boost and more reliable html convertion). Given so, my understanding of CMap might not be any better than yours. The most I know is that it is a mapping of characters within the document. So if a cid is not found, that would mean that either;
I did not port the implementation well enough the original writers in pdfminer made a mistake the tool you used to make the pdf errored while writing the table and some of rows are missing (or perhaps the character itself was written wrong) the pdf document was corrupted at one point — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/asosnovsky/pdfmajor/issues/2?email_source=notifications&email_token=ANKCUCPVXECKIMOWD5YOZQDQYLYZPA5CNFSM4JW3NT62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGYW23I#issuecomment-565276013, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKCUCPXQBR22UE5HARZZWLQYLYZPANCNFSM4JW3NT6Q.