Issue when locale is set to french
Bug
When explicitly setting the locale as french docling is not able to parse pdf correctly. The problem seems to be linked particularly to locale.LC_NUMERIC as with this configuration docling is functioning correctly:
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") locale.setlocale(locale.LC_NUMERICE,"en_US.UTF-8")
whereas when you just set explicitely LC_NUMERIC it's not :
locale.setlocale(locale.LC_NUMERIC, "fr_FR.UTF-8")
Steps to reproduce
import locale
from docling.document_converter import DocumentConverter
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") convert = DocumentConverter() pdf_path = "PathToYourPDF" result = convert.convert(pdf_path)
print(result.document.export_to_markdown())
Docling version
Docling version: 2.30.0 Docling Core version: 2.28.0 Docling IBM Models version: 3.4.2 Docling Parse version: 4.0.1 Python: cpython-311 (3.11.9) Platform: macOS-14.7.4-x86_64-i386-64bit
Python version
...
Python 3.11.9
@yannistml I can confirm docling starts to hang on our standard test PDF in tests/data/pdf/2206.01062.pdf and produces garbage output in the end. The problem appears to be rooted in the layout postprocessing. Deeper investigation will be necessary to get to the root of it.