Special characters are no longer recognized correctly
Bug
This bug does not exist until version: docling 2.9.0 docling-core 2.10.0 docling-ibm-models 2.0.8 docling-parse 2.1.2
I show the bug with two different parts in a pdf.
Here the first example: even if the header is not depicted below correctly, it is correct. So no worries about this. However, the glyphs are a big problem.
| | | Shape | Appearance | Appearance | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | | | | . | layout type. | using ground truth. | family.(S. 4.1) | breed (S. 4.2) | breed (S. 4.2) | both (S. 4.3) | both (S. 4.3) |
| . | cat | dog | hierarchical | flat | |||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | glyph[check] | - | - | 94.21 | NA | NA | NA | NA |
| 1 | 2 | - | Image | - | 82.56 | 52.01 | 40.59 | NA | 39.64 |
| 2 | 3 | - | Image + Head | - | 85.06 | 60.37 | 52.10 | NA | 51.23 |
| 3 | 4 | - | Image + Head + Body | - | 87.78 | 64.27 | 54.31 | NA | 54.05 |
| 4 | 5 | - | Image + Head + Body | glyph[check] | 88.68 | 66.12 | 57.29 | NA | 56.60 |
| 5 | 6 | glyph[check] | Image | - | 94.88 | 50.27 | 42.94 | 42.29 | 43.30 |
| 6 | 7 | glyph[check] | Image + Head | - | 95.07 | 59.11 | 54.56 | 52.78 | 54.03 |
| 7 | 8 | glyph[check] | Image + Head + Body | - | 94.89 | 63.48 | 55.68 | 55.26 | 56.68 |
| 8 | 9 | glyph[check] | Image + Head + Body | glyph[check] | 95.37 | 66.07 | 59.18 | 57.77 | 59.21 |
The original looks like:
Here you can see a table with special characters such as the check sign. They were recognized correctly in the version without GPU.
Here the second example:
Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar
W * glyph[circledot] W *4 GLYPH<16> V GLYPH<133> 2 GLYPH<240> 4 GLYPH<239> V * ··· 5 glyph[floorleft] ⁄GLYPH<134> GLYPH<239> glyph[circledot] GLYPH<16> glyph[circledot] GLYPH<134> glyph[turnstileright] glyph[circledot] ⁄GLYPH<134> · GLYPH<16> V 4 GLYPH<239> 4 glyph[turnstileright] 4 -d 5GLYPH<134> V glyph[circledot] dd4GLYPH<23> glyph[circledot] GLYPH<134> glyph[turnstileright] glyph[circledot] GLYPH<226> ··· 52 21)
IBM Research Saumerstrasse 4 8803 Ruschlikon, Switzerland
The original looks like this:
The glyphs come from the topmost line.
There is even a second bug: The ä,ü are not recognized correctly as well. But this was also true in the old versions.
Steps to reproduce
I provide you the pdf for the second example. article.pdf
Docling version
docling 2.12.0 docling-core 2.10.0 docling-ibm-models 3.1.0 docling-parse 3.0.0
Python version
python 3.10