[Question]: When using DeepDOC to identify Chinese pdf, the numbers in chunks parsed out are out of order
Describe your problem
- Create a knowledge base
- Upload a pdf in Chinese and configure the parsing method
- Click the Execute button and wait for it to finish
- Check the generated chunks
As you can see from the above results, the sequence of numbers has been scrambled. Different embedding model and OCR methods were tried, including DeepDOC and Plain Text, but this problem was not solved. How can I ensure that chunks are in the same order as the original document?
Could you share this file here?
I have uploaded the original file and use the General parsing method and Plain Text layout recognition to solve this problem, but the resulting chunks do not work very well
@KevinHuSh Hi, can you answer that when you have a moment?
I've tested it. The current model has flaws to recognize correctly for these kind of files.
I've tested it. The current model has flaws to recognize correctly for these kind of files.
Is this bug solved? So the best way now is to use other ocr tools to convert pdf to pure text first? Then we use the text as input?