PaddleOCR
PaddleOCR copied to clipboard
Romanian Corpus and Character set
The Romanian corpus file is a cleaned version of the official Romanian Scrabble word list (https://dexonline.ro/scrabble), licensed under GPL (https://dexonline.ro/licenta). In addition to the base form of the words, it contains the inflexions and the diacriticless form (diacritics are mostly not used online). Please let me know if the corpus should be simplified. I'm not sure if that's anything else I should add. Here's the Wikipedia page about the Romanian language: https://en.wikipedia.org/wiki/Romanian_language.
This is the same as https://github.com/PaddlePaddle/PaddleOCR/pull/5881, which I closed because the pull request was done with a different email address, which in turn didn't let me sign the CLA.
Thanks for your contribution!
Please provide some feedback as to what more needs to be done to merge the Romanian corpus and character set. I see that the Vietnamese PR (#7933) is in the limbo as well. Clearly, the steps outlined in Multilingual OCR Development Plan (#1048) are not enough.