docling icon indicating copy to clipboard operation
docling copied to clipboard

UnicodeEncodeError - Several Different PDFs

Open nickrallison opened this issue 1 year ago • 1 comments

Bug

Docling parses some pdfs successfully but fails to write the markdown file with the results. UnicodeEncodeError: 'charmap' codec can't encode character '\u2217' in position 51: character maps to <undefined>

I was able to resolve this for this specific PDF by changing line 1941 in this file under docling_core\types\doc\document.py but the tests failed image

Steps to reproduce

  1. Download this pdf: https://typeset.io/pdf/computational-challenges-in-bounded-model-checking-44b7toabj9.pdf
  2. docling computational-challenges-in-bounded-model-checking-44b7toabj9.pdf

I've encountered this on other PDFs as well: https://batch.libretexts.org/print/url=https://math.libretexts.org/Bookshelves/Combinatorics_and_Discrete_Mathematics/Elementary_Foundations%3A_An_Introduction_to_Topics_in_Discrete_Mathematics_(Sylvestre)/03%3A_Boolean_algebra/3.02%3A_Disjunctive_Normal_Form.pdf

Docling version

Docling version: 2.12.0 Docling Core version: 2.9.0 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0

Python version

Python 3.11.9

nickrallison avatar Dec 14 '24 21:12 nickrallison

@nickrallison thanks for pointing this out! I'll test it more, and if it's not breaking anything (which I think it shouldn't) we'll regenerate tests.

maxmnemonic avatar Dec 19 '24 11:12 maxmnemonic

Just wanted to chime in that I love docling, been using it quite a bit for a personal project. I've encountered this error quite a bit when trying to parse long 300+ page PDFs into markdown. In a sample of about 50 lengthy PDFs about 50% of them experience this issue

evan-rash avatar Dec 27 '24 15:12 evan-rash

@maxmnemonic Something that may provide a simple solution could be a library like chardet.

I'm not versed in what mechanism is used to actually generate the data that goes into the text file behind the scenes with docling but something like chardet should (i think) be able to detect the encoding regardless

I want to say thanks for all the hard work on the project, I would love to use it more but run into this issue on most documents I scan (mostly latex research papers). I would be ecstatic if this family of issues would be able to be resolved.

As a side note, do you have a set of PDFs & rendered texts they correspond to for testing? My troubles make me wonder if there is some family of PDFs that have not been tested

nickrallison avatar Jan 08 '25 19:01 nickrallison

@nickrallison I re-tested both PDFs you provide above with docling==2.17.0, and I get output in both cases. I will therefore close this issue as resolved. If you find more evidence that this problem still exists, please re-open! thanks.

cau-git avatar Jan 31 '25 09:01 cau-git