docling UnicodeEncodeError - Several Different PDFs

Bug

Docling parses some pdfs successfully but fails to write the markdown file with the results. UnicodeEncodeError: 'charmap' codec can't encode character '\u2217' in position 51: character maps to <undefined>

I was able to resolve this for this specific PDF by changing line 1941 in this file under docling_core\types\doc\document.py but the tests failed

Steps to reproduce

Download this pdf: https://typeset.io/pdf/computational-challenges-in-bounded-model-checking-44b7toabj9.pdf
docling computational-challenges-in-bounded-model-checking-44b7toabj9.pdf

I've encountered this on other PDFs as well: https://batch.libretexts.org/print/url=https://math.libretexts.org/Bookshelves/Combinatorics_and_Discrete_Mathematics/Elementary_Foundations%3A_An_Introduction_to_Topics_in_Discrete_Mathematics_(Sylvestre)/03%3A_Boolean_algebra/3.02%3A_Disjunctive_Normal_Form.pdf

Docling version

Docling version: 2.12.0 Docling Core version: 2.9.0 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0

Python version

Python 3.11.9

Dec 14 '24 21:12 nickrallison

@nickrallison thanks for pointing this out! I'll test it more, and if it's not breaking anything (which I think it shouldn't) we'll regenerate tests.

Dec 19 '24 11:12 maxmnemonic

Just wanted to chime in that I love docling, been using it quite a bit for a personal project. I've encountered this error quite a bit when trying to parse long 300+ page PDFs into markdown. In a sample of about 50 lengthy PDFs about 50% of them experience this issue

Dec 27 '24 15:12 evan-rash

@maxmnemonic Something that may provide a simple solution could be a library like chardet.

I'm not versed in what mechanism is used to actually generate the data that goes into the text file behind the scenes with docling but something like chardet should (i think) be able to detect the encoding regardless

I want to say thanks for all the hard work on the project, I would love to use it more but run into this issue on most documents I scan (mostly latex research papers). I would be ecstatic if this family of issues would be able to be resolved.

As a side note, do you have a set of PDFs & rendered texts they correspond to for testing? My troubles make me wonder if there is some family of PDFs that have not been tested

Jan 08 '25 19:01 nickrallison

@nickrallison I re-tested both PDFs you provide above with docling==2.17.0, and I get output in both cases. I will therefore close this issue as resolved. If you find more evidence that this problem still exists, please re-open! thanks.

Jan 31 '25 09:01 cau-git