PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

Romanian Corpus and Character set

Open the-ge opened this issue 2 years ago • 2 comments

The Romanian corpus file is a cleaned version of the official Romanian Scrabble word list (https://dexonline.ro/scrabble), licensed under GPL (https://dexonline.ro/licenta). In addition to the base form of the words, it contains the inflexions and the diacriticless form (diacritics are mostly not used online). Please let me know if the corpus should be simplified. I'm not sure if that's anything else I should add. Here's the Wikipedia page about the Romanian language: https://en.wikipedia.org/wiki/Romanian_language.

This is the same as https://github.com/PaddlePaddle/PaddleOCR/pull/5881, which I closed because the pull request was done with a different email address, which in turn didn't let me sign the CLA.

the-ge avatar Mar 15 '23 19:03 the-ge

Thanks for your contribution!

paddle-bot[bot] avatar Mar 15 '23 19:03 paddle-bot[bot]

Please provide some feedback as to what more needs to be done to merge the Romanian corpus and character set. I see that the Vietnamese PR (#7933) is in the limbo as well. Clearly, the steps outlined in Multilingual OCR Development Plan (#1048) are not enough.

the-ge avatar Apr 13 '23 21:04 the-ge