UnicodeEncodeError - Several Different PDFs
Bug
Docling parses some pdfs successfully but fails to write the markdown file with the results.
UnicodeEncodeError: 'charmap' codec can't encode character '\u2217' in position 51: character maps to <undefined>
I was able to resolve this for this specific PDF by changing line 1941 in this file under docling_core\types\doc\document.py but the tests failed
Steps to reproduce
- Download this pdf: https://typeset.io/pdf/computational-challenges-in-bounded-model-checking-44b7toabj9.pdf
docling computational-challenges-in-bounded-model-checking-44b7toabj9.pdf
I've encountered this on other PDFs as well: https://batch.libretexts.org/print/url=https://math.libretexts.org/Bookshelves/Combinatorics_and_Discrete_Mathematics/Elementary_Foundations%3A_An_Introduction_to_Topics_in_Discrete_Mathematics_(Sylvestre)/03%3A_Boolean_algebra/3.02%3A_Disjunctive_Normal_Form.pdf
Docling version
Docling version: 2.12.0 Docling Core version: 2.9.0 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0
Python version
Python 3.11.9
@nickrallison thanks for pointing this out! I'll test it more, and if it's not breaking anything (which I think it shouldn't) we'll regenerate tests.
Just wanted to chime in that I love docling, been using it quite a bit for a personal project. I've encountered this error quite a bit when trying to parse long 300+ page PDFs into markdown. In a sample of about 50 lengthy PDFs about 50% of them experience this issue
@maxmnemonic Something that may provide a simple solution could be a library like chardet.
I'm not versed in what mechanism is used to actually generate the data that goes into the text file behind the scenes with docling but something like chardet should (i think) be able to detect the encoding regardless
I want to say thanks for all the hard work on the project, I would love to use it more but run into this issue on most documents I scan (mostly latex research papers). I would be ecstatic if this family of issues would be able to be resolved.
As a side note, do you have a set of PDFs & rendered texts they correspond to for testing? My troubles make me wonder if there is some family of PDFs that have not been tested
@nickrallison I re-tested both PDFs you provide above with docling==2.17.0, and I get output in both cases. I will therefore close this issue as resolved. If you find more evidence that this problem still exists, please re-open! thanks.