docling icon indicating copy to clipboard operation
docling copied to clipboard

Incorrect parsing Persian PDF

Open sadeghtkd opened this issue 11 months ago • 1 comments

Bug

The attached PDF is in Persian language and when parsing it to markdown the letters are in wrong order. ...

Steps to reproduce

To resolve this problem I used the bidi package. It almost fixed the problem but not fully.

from docling.document_converter import DocumentConverter
source = "omidname-shemsh.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
md_result = result.document.export_to_markdown()
corrected_md_result = get_display(md_result)

print(corrected_md_result)

the result output :

<!-- image -->

ناريا یلااک سروب تکرش

## يما د همان

یلصا رازاب رد شريذپ -یلااک یلخاد

: لااک مان

the corrected_md_result output:

<!-- image -->

شرکت بورس کاالی ايران

نامه د امي ##

داخلی کاالی- پذيرش در بازار اصلی

نام کاال :

Docling version

docling-2.15.1

Python version

Python 3.12.3

omidname-shemsh.pdf

sadeghtkd avatar Jan 12 '25 09:01 sadeghtkd

This will be resolved once this issue is taken care of: https://github.com/DS4SD/docling-parse/issues/93

PeterStaar-IBM avatar Jan 31 '25 13:01 PeterStaar-IBM

This was addressed in the meantime. Please re-open if you find this still broken.

cau-git avatar May 21 '25 14:05 cau-git