ragflow [Bug]: Document parsing error with traditional Chinese characters.

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

26d3480e

RAGFlow image version

v0.19.0

Other environment information

Actual behavior

Operation steps: Log in to ragflow → Knowledge Base → Upload Document → Parse Document → Prompt "Failed" in parsing.

Expected behavior

Operation steps: Log in to ragflow → Knowledge Base → Upload Document → Parse Document → Prompt "Failed" in parsing.

Steps to reproduce

Operation steps:
Log in to ragflow → Knowledge Base → Upload Document → Parse Document → Prompt "Failed" in parsing.

Additional information

No response

Jun 10 '25 11:06 Maxliang2025

Hi @Maxliang2025! I'm Dosu and I’m helping the ragflow team.

RAGFlow relies on the HanziConv library to convert traditional Chinese characters to simplified Chinese during document parsing, especially in the PDF and DOCX parsers. This conversion happens automatically and unconditionally, with no error handling or fallback if HanziConv encounters ambiguous or unsupported characters. There are no documented limitations or workarounds in the codebase for HanziConv, so any issues in the library (like incomplete mappings or context-insensitive conversion) will directly affect parsing results and may cause failures or missing content in downstream processing. If you consistently see parsing failures with documents containing traditional Chinese, it may be due to edge cases or limitations in HanziConv's conversion logic, or possibly encoding issues in the input file itself.

You might want to check if the document opens and displays correctly in other tools, try converting it to UTF-8 encoding, or test with a simplified Chinese version to isolate the problem. If the issue persists only with traditional Chinese, it likely stems from HanziConv's handling, which RAGFlow does not currently validate or guard against.
Reference: tokenizer and parser implementation

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Jun 10 '25 11:06 dosubot[bot]

I’ve also uploaded documents in Traditional Chinese without encountering the issue you described. Could you let me know what file type you were uploading—was it a PDF or PPT?

Jun 12 '25 02:06 midilandlu