olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

What can I do to support extract PDF with Chinese characters?

Open videni opened this issue 10 months ago • 3 comments

🚀 The feature, motivation and pitch

Hi, thanks for this awesome project, I'd like to process PDF with chinese language, what should I do ?

Alternatives

No response

Additional context

No response

videni avatar Mar 04 '25 21:03 videni

In my case , it works. The chinese characters are extracted as unicode characters ,something like "\u5230\u9632\u5c18\u3001" . and then you can decode it .

oldunclez avatar Mar 05 '25 03:03 oldunclez

Hi, @oldunclez, thanks for your help. The official intro says "The current model was fine-tuned on English documents; other languages are not likely to work." Does it mean the accuracy of a non-English language such as Chinese might not be good enough?

videni avatar Mar 05 '25 03:03 videni

https://olmocr.allenai.org/ online demo allows you to try for ten pages of your file, so you may want to upload your own file and see for yourself how well it works.

Itachired avatar Apr 14 '25 01:04 Itachired