[Question]: Support on Arabic PDF's

Open karthik-v-b opened this issue 2 months ago • 1 comments

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

I’m working with Arabic PDFs, and when parsing them using Deepdoc Parser with the paper chunking method, the extracted text appears in LTR (Left to Right) instead of the correct RTL (Right to Left) order used in Arabic. This results in scrambled text and unreadable chunks.

From previous closed issues, I learned that converting the PDF to a Word document (.docx) before parsing can help resolve the RTL/LTR issue. However, some of my PDFs contain embedded images and charts, and when converting them to Word, the resulting .docx file becomes completely empty.

For an automated data ingestion pipeline that loads documents into a Ragflow dataset, are there any recommended tools or services that can convert PDF → DOCX while meeting the following requirements?

Convert PDF → Word reliably for Arabic (RTL) text

Preserve embedded images, charts, and other visual elements

Integrate smoothly into a Python workflow (local library or cloud API)

If you have suggestions on libraries, external APIs, or best-practice approaches for handling Arabic PDFs with mixed content during conversion, it would be extremely helpful.

Nov 16 '25 09:11 karthik-v-b

you can raise a feature request for this.

Nov 27 '25 03:11 Magicbook1108