docling icon indicating copy to clipboard operation
docling copied to clipboard

Special characters are no longer recognized correctly

Open JeandeBalzac opened this issue 1 year ago • 0 comments

Bug

This bug does not exist until version: docling 2.9.0 docling-core 2.10.0 docling-ibm-models 2.0.8 docling-parse 2.1.2

I show the bug with two different parts in a pdf.

Here the first example: even if the header is not depicted below correctly, it is correct. So no worries about this. However, the glyphs are a big problem.

| | | Shape | Appearance | Appearance | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | Classification Accuracy (%) | | | | . | layout type. | using ground truth. | family.(S. 4.1) | breed (S. 4.2) | breed (S. 4.2) | both (S. 4.3) | both (S. 4.3) |

. cat dog hierarchical flat
0 1 glyph[check] - - 94.21 NA NA NA NA
1 2 - Image - 82.56 52.01 40.59 NA 39.64
2 3 - Image + Head - 85.06 60.37 52.10 NA 51.23
3 4 - Image + Head + Body - 87.78 64.27 54.31 NA 54.05
4 5 - Image + Head + Body glyph[check] 88.68 66.12 57.29 NA 56.60
5 6 glyph[check] Image - 94.88 50.27 42.94 42.29 43.30
6 7 glyph[check] Image + Head - 95.07 59.11 54.56 52.78 54.03
7 8 glyph[check] Image + Head + Body - 94.89 63.48 55.68 55.26 56.68
8 9 glyph[check] Image + Head + Body glyph[check] 95.37 66.07 59.18 57.77 59.21

The original looks like: Screenshot from 2024-12-15 09-15-35 Here you can see a table with special characters such as the check sign. They were recognized correctly in the version without GPU. Here the second example:

Nikolaos Livathinos, Cesar Berrospi, Maksym Lysak, Viktor Kuropiatnyk, Ahmed Nassar, Andre Carvalho, Michele Dolfi, Christoph Auer, Kasper Dinkla, Peter Staar

W * glyph[circledot] W *4 GLYPH<16> V GLYPH<133> 2 GLYPH<240> 4 GLYPH<239> V * ··· 5 glyph[floorleft] ⁄GLYPH<134> GLYPH<239> glyph[circledot] GLYPH<16> glyph[circledot] GLYPH<134> glyph[turnstileright] glyph[circledot] ⁄GLYPH<134> · GLYPH<16> V 4 GLYPH<239> 4 glyph[turnstileright] 4 -d 5GLYPH<134> V glyph[circledot] dd4GLYPH<23> glyph[circledot] GLYPH<134> glyph[turnstileright] glyph[circledot] GLYPH<226> ··· 52 21) IBM Research Saumerstrasse 4 8803 Ruschlikon, Switzerland The original looks like this: The glyphs come from the topmost line. There is even a second bug: The ä,ü are not recognized correctly as well. But this was also true in the old versions. image

Steps to reproduce

I provide you the pdf for the second example. article.pdf

Docling version

docling 2.12.0 docling-core 2.10.0 docling-ibm-models 3.1.0 docling-parse 3.0.0

Python version

python 3.10

JeandeBalzac avatar Dec 15 '24 08:12 JeandeBalzac