docling Incorrect parsing Persian PDF

Bug

The attached PDF is in Persian language and when parsing it to markdown the letters are in wrong order. ...

Steps to reproduce

To resolve this problem I used the bidi package. It almost fixed the problem but not fully.

from docling.document_converter import DocumentConverter
source = "omidname-shemsh.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
md_result = result.document.export_to_markdown()
corrected_md_result = get_display(md_result)

print(corrected_md_result)

the result output :

<!-- image -->

ناريا یلااک سروب تکرش

## يما د همان

یلصا رازاب رد شريذپ -یلااک یلخاد

: لااک مان

the corrected_md_result output:

<!-- image -->

شرکت بورس کاالی ايران

نامه د امي ##

داخلی کاالی- پذيرش در بازار اصلی

نام کاال :

Docling version

docling-2.15.1

Python version

Python 3.12.3

omidname-shemsh.pdf

Jan 12 '25 09:01 sadeghtkd

This will be resolved once this issue is taken care of: https://github.com/DS4SD/docling-parse/issues/93

Jan 31 '25 13:01 PeterStaar-IBM

This was addressed in the meantime. Please re-open if you find this still broken.

May 21 '25 14:05 cau-git