docling icon indicating copy to clipboard operation
docling copied to clipboard

Issue when locale is set to french

Open yannistml opened this issue 8 months ago • 1 comments

Bug

When explicitly setting the locale as french docling is not able to parse pdf correctly. The problem seems to be linked particularly to locale.LC_NUMERIC as with this configuration docling is functioning correctly:

locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") locale.setlocale(locale.LC_NUMERICE,"en_US.UTF-8")

whereas when you just set explicitely LC_NUMERIC it's not :

locale.setlocale(locale.LC_NUMERIC, "fr_FR.UTF-8")

Steps to reproduce

import locale

from docling.document_converter import DocumentConverter

locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") convert = DocumentConverter() pdf_path = "PathToYourPDF" result = convert.convert(pdf_path)

print(result.document.export_to_markdown())

Docling version

Docling version: 2.30.0 Docling Core version: 2.28.0 Docling IBM Models version: 3.4.2 Docling Parse version: 4.0.1 Python: cpython-311 (3.11.9) Platform: macOS-14.7.4-x86_64-i386-64bit

Python version

...

Python 3.11.9

yannistml avatar Apr 24 '25 15:04 yannistml

@yannistml I can confirm docling starts to hang on our standard test PDF in tests/data/pdf/2206.01062.pdf and produces garbage output in the end. The problem appears to be rooted in the layout postprocessing. Deeper investigation will be necessary to get to the root of it.

cau-git avatar May 21 '25 13:05 cau-git