What can I do to support extract PDF with Chinese characters?
🚀 The feature, motivation and pitch
Hi, thanks for this awesome project, I'd like to process PDF with chinese language, what should I do ?
Alternatives
No response
Additional context
No response
In my case , it works. The chinese characters are extracted as unicode characters ,something like "\u5230\u9632\u5c18\u3001" . and then you can decode it .
Hi, @oldunclez, thanks for your help. The official intro says "The current model was fine-tuned on English documents; other languages are not likely to work." Does it mean the accuracy of a non-English language such as Chinese might not be good enough?
https://olmocr.allenai.org/ online demo allows you to try for ten pages of your file, so you may want to upload your own file and see for yourself how well it works.