docling
docling copied to clipboard
Incorrect parsing Persian PDF
Bug
The attached PDF is in Persian language and when parsing it to markdown the letters are in wrong order. ...
Steps to reproduce
To resolve this problem I used the bidi package. It almost fixed the problem but not fully.
from docling.document_converter import DocumentConverter
source = "omidname-shemsh.pdf" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
md_result = result.document.export_to_markdown()
corrected_md_result = get_display(md_result)
print(corrected_md_result)
the result output :
<!-- image -->
ناريا یلااک سروب تکرش
## يما د همان
یلصا رازاب رد شريذپ -یلااک یلخاد
: لااک مان
the corrected_md_result output:
<!-- image -->
شرکت بورس کاالی ايران
نامه د امي ##
داخلی کاالی- پذيرش در بازار اصلی
نام کاال :
Docling version
docling-2.15.1
Python version
Python 3.12.3
This will be resolved once this issue is taken care of: https://github.com/DS4SD/docling-parse/issues/93
This was addressed in the meantime. Please re-open if you find this still broken.