ragflow [Question]: When using DeepDOC to identify Chinese pdf, the numbers in chunks parsed out are out of order

Describe your problem

Create a knowledge base

Upload a pdf in Chinese and configure the parsing method

Click the Execute button and wait for it to finish

Check the generated chunks

As you can see from the above results, the sequence of numbers has been scrambled. Different embedding model and OCR methods were tried, including DeepDOC and Plain Text, but this problem was not solved. How can I ensure that chunks are in the same order as the original document？

Feb 12 '25 03:02 xyk0930

Could you share this file here?

Feb 13 '25 02:02 KevinHuSh

中国高血压防治指南(2024年修订版).pdf

I have uploaded the original file and use the General parsing method and Plain Text layout recognition to solve this problem, but the resulting chunks do not work very well

Feb 13 '25 02:02 xyk0930

@KevinHuSh Hi, can you answer that when you have a moment?

Feb 18 '25 02:02 xyk0930

I've tested it. The current model has flaws to recognize correctly for these kind of files.

Feb 18 '25 05:02 KevinHuSh

I've tested it. The current model has flaws to recognize correctly for these kind of files.

Is this bug solved? So the best way now is to use other ocr tools to convert pdf to pure text first? Then we use the text as input?

Apr 10 '25 07:04 caolonghao