PaddleOCR Multilingual OCR Development Plan

model name description model size download Update Date

ch Chinese and English 3.71M inference model / trained model 2020.9.22

ch_tra chinese traditional 5.63M inference model / trained model 2021.1.21

en English 2.56M inference model / trained model 2020.9.22

fr French 2.65M inference model / trained model 2021.9.22

ar Arabic 2.53M inference model / trained model 2021.1.21

es Spanish 2.53M inference model / trained model 2021.1.21

pt Portuguese 2.63M inference model / trained model 2021.1.21

ru Russia 2.63M inference model / trained model 2021.1.21

ge german 2.65M inference model / trained model 2020.9.22

kr Korean 3.9M inference model / trained model 2020.9.22

jp Japanese 4.23M inference model / trained model 2020.9.22

it Italian 2.53M inference model / trained model 2021.1.21

hi Hindi 2.63M inference model / trained model 2021.1.21

ug Uyghur 2.63M inference model / trained model 2021.1.21

fa Persian 2.63M inference model / trained model 2021.1.21

ur Urdu 2.63M inference model / trained model 2021.1.21

oc Occitan 2.53M inference model / trained model 2021.1.21

mr Marathi 2.63M inference model / trained model 2021.1.21

ne Nepali 2.63M inference model / trained model 2021.1.21

rs_cyrillic Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21

rs_latin Serbian(latin) 2.53M inference model / trained model 2021.1.21

bg Bulgarian 2.63M inference model / trained model 2021.1.21

uk Ukranian 2.63M inference model / trained model 2021.1.21

be Belarusian 2.63M inference model / trained model 2021.1.21

te Telugu 2.63M inference model / trained model 2021.1.21

kn Kannada 2.63M inference model / trained model 2021.1.21

ta Tamil 2.63M inference model / trained model 2021.1.21

mg Mongolian -- Ongoing

bg Bangla -- Need dict and corpus

vi Vietnamese -- Ongoing

bm Burmese -- Need dict and corpus

tr Turkish -- Need corpus

po polish -- Need dict and corpus

More TBC

model name	description	model size	download	Update Date
ch	Chinese and English	3.71M	inference model / trained model	2020.9.22
ch_tra	chinese traditional	5.63M	inference model / trained model	2021.1.21
en	English	2.56M	inference model / trained model	2020.9.22
fr	French	2.65M	inference model / trained model	2021.9.22
ar	Arabic	2.53M	inference model / trained model	2021.1.21
es	Spanish	2.53M	inference model / trained model	2021.1.21
pt	Portuguese	2.63M	inference model / trained model	2021.1.21
ru	Russia	2.63M	inference model / trained model	2021.1.21
ge	german	2.65M	inference model / trained model	2020.9.22
kr	Korean	3.9M	inference model / trained model	2020.9.22
jp	Japanese	4.23M	inference model / trained model	2020.9.22
it	Italian	2.53M	inference model / trained model	2021.1.21
hi	Hindi	2.63M	inference model / trained model	2021.1.21
ug	Uyghur	2.63M	inference model / trained model	2021.1.21
fa	Persian	2.63M	inference model / trained model	2021.1.21
ur	Urdu	2.63M	inference model / trained model	2021.1.21
oc	Occitan	2.53M	inference model / trained model	2021.1.21
mr	Marathi	2.63M	inference model / trained model	2021.1.21
ne	Nepali	2.63M	inference model / trained model	2021.1.21
rs_cyrillic	Serbian(cyrillic)	2.63M	inference model / trained model	2021.1.21
rs_latin	Serbian(latin)	2.53M	inference model / trained model	2021.1.21
bg	Bulgarian	2.63M	inference model / trained model	2021.1.21
uk	Ukranian	2.63M	inference model / trained model	2021.1.21
be	Belarusian	2.63M	inference model / trained model	2021.1.21
te	Telugu	2.63M	inference model / trained model	2021.1.21
kn	Kannada	2.63M	inference model / trained model	2021.1.21
ta	Tamil	2.63M	inference model / trained model	2021.1.21
mg	Mongolian	--	Ongoing
bg	Bangla	--	Need dict and corpus
vi	Vietnamese	--	Ongoing
bm	Burmese	--	Need dict and corpus
tr	Turkish	--	Need corpus
po	polish	--	Need dict and corpus
	More		TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：

In folder ppocr/utils/dict, it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.
In folder ppocr/utils/corpus, it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language. Maybe, 50000 words per language is necessary at least. Of course, the more, the better.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

Oct 28 '20 15:10 D-DanielYang

Traditional Mongolian

Nov 02 '20 03:11 saheya

I would love to work on "Bangla"

Nov 08 '20 07:11 omar16100

I very happy if you do that with Vietnamese

Nov 10 '20 08:11 levanpon98

How about Arabic? That would be great.

Nov 10 '20 22:11 HusseinYoussef

I've find out that PADDLE OCR algorithm cannot recognize some special characters (such as comma, semicolon, or dot...) when the language is english. Is there any possible way that i can fix this problem

Nov 18 '20 08:11 Hieung28

I would like to contribute to add the Burmese language. Is it only needed to submit two text files - dict & corpus? How further process do we need to provide?

Nov 27 '20 22:11 GmGniap

Adding "Bangla" will be grate for the people in south Asia

Nov 28 '20 02:11 xeron56

Adding "Traditional Chinese (zh-TW)" would be great support.

Dec 07 '20 02:12 giranntu

Do you have preTrained Russian recognition model?

Dec 07 '20 10:12 Ru-Van

Hi adding " Tamil" language will be very grateful.

Tamil_dict.txt Tamil_corpus.txt

Need more help plz refer this issue: https://github.com/JaidedAI/EasyOCR/issues/39

Dec 21 '20 16:12 SasiAravind

I can help with Turkish language.

Dec 24 '20 07:12 fcakyon

I can help with polish language.

Jan 03 '21 20:01 krzynio

@GmGniap Hello, Can you provide the corpus file of Burmese Language？

Jan 26 '21 05:01 xmy0916

@shahidul56 Hello, Can you provide the corpus file of Bangla Languag？

Jan 26 '21 06:01 xmy0916

All models updated in 2021.1.21 cannot be downloaded with following Error： { code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Jan 26 '21 10:01 azmat21

All models updated in 2021.1.21 cannot be downloaded with following Error： { code: "NoSuchKey", message: "The specified key does not exist.", requestId: "aa1bfeff-f572-40aa-8935-6129b1533ed1" }

Sorry for the invalid links and all of them have been revised now, you can try again.

Jan 27 '21 08:01 D-DanielYang

I very happy if you do that with Vietnamese

#1847, seems to be ongoing.

Jan 27 '21 19:01 redcinelli

@redcinelli Thank you very much. The Vietnamese model is in training and will be available soon~

Jan 28 '21 06:01 xmy0916

model name description model size download Update Date ch Chinese and English 3.71M inference model / trained model 2020.9.22 cht chinese traditional 5.63M inference model / trained model 2021.1.21 en English 2.56M inference model / trained model 2020.9.22 fr French 2.65M inference model / trained model 2021.9.22 ar Arabic 2.53M inference model / trained model 2021.1.21 xi Spanish 2.53M inference model / trained model 2021.1.21 pu Portuguese 2.63M inference model / trained model 2021.1.21 ru Russia 2.63M inference model / trained model 2021.1.21 ge german 2.65M inference model / trained model 2020.9.22 kr Korean 3.9M inference model / trained model 2020.9.22 jp Japanese 4.23M inference model / trained model 2020.9.22 it Italian 2.53M inference model / trained model 2021.1.21 hi Hindi 2.63M inference model / trained model 2021.1.21 ug Uyghur 2.63M inference model / trained model 2021.1.21 fa Persian 2.63M inference model / trained model 2021.1.21 ur Urdu 2.63M inference model / trained model 2021.1.21 rs Serbian(latin) 2.53M inference model / trained model 2021.1.21 oc Occitan 2.53M inference model / trained model 2021.1.21 mr Marathi 2.63M inference model / trained model 2021.1.21 ne Nepali 2.63M inference model / trained model 2021.1.21 rsc Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21 bg Bulgarian 2.63M inference model / trained model 2021.1.21 uk Ukranian 2.63M inference model / trained model 2021.1.21 be Belarusian 2.63M inference model / trained model 2021.1.21 te Telugu 2.63M inference model / trained model 2021.1.21 ka Kannada 2.63M inference model / trained model 2021.1.21 ta Tamil 2.63M inference model / trained model 2021.1.21 mg Mongolian -- Ongoing bg Bangla -- Need dict and corpus vi Vietnamese -- Need dict and corpus bm Burmese -- Need dict and corpus tk Turkish -- Need dict and corpus po polish -- Need dict and corpus More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：
1. In folder [ppocr/utils/dict](./ppocr/utils/dict),
   it is necessary to submit the dict text to this path and name it with `{language}_dict.txt` that contains a list of all characters. Please see the format example from other files in that folder.

2. In folder [ppocr/utils/corpus](./ppocr/utils/corpus),
   it is necessary to submit the corpus to this path and name it with `{language}_corpus.txt` that contains a list of words in your language.
   Maybe, 50000 words per language is necessary at least.
   Of course, the more, the better.
If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

@grasswolfs model name for Turkish should be "tr" instead of "tk", it is the widely used abbreviation for Turkish.

Jan 28 '21 07:01 fcakyon

I have also opened a pr for Turkish dict and corpora: https://github.com/PaddlePaddle/PaddleOCR/pull/1856

Jan 28 '21 07:01 fcakyon

Thanks @habout632 for adding Southeast Asian languages via #1896

Feb 02 '21 03:02 tink2123

Here is a dictionary for Greek. el_dict.txt

Mar 16 '21 14:03 yumeliu

Hi , did we have a model to detect all English characters along with special characters like.,"()

Mar 16 '21 17:03 alenma04

hi, thank you for the great work! I just wonder whether you will add traditional Chinese to the general model? Right now, the general model can support Chinese(sim), English and numbers.

May 12 '21 08:05 Jane-Ding

Hi, can we give line data above 50 max_char_length data for training? after training rec model on 25 char length as well as 50 char length found that 25 char length less loss and good acc but 50 char length data more loose and less acc please find sample devnagri data

train_img/0022_BindiyaKiAathmakatha_Img_300_Org_Page_0001_crop_9.jpg बीत गया । असमय के इस बुढ़ापे की देहली पर बैठी, मौत की train_img/0022_BindiyaKiAathmakatha_Img_300_Org_Page_0001_crop_10.jpg प्रतीक्षा कर रही हूँ । पर लगाता है उसने भी सबों के साथ-साथ

May 15 '21 06:05 JITESH11989

After downloading the inference and Trained model, how can I use them ? Can anyone point out some resources of Testing / Evaluating code using these models

Thanks

Jun 24 '21 13:06 MANISH007700

请问有计划开发一个统一模型，支持多语种文字混合排版的图片的识别吗？谢谢。

Jun 29 '21 08:06 wuye9036

Traditional Mongolian 👀

Jul 23 '21 16:07 ESWZY

model name description model size download Update Date ch Chinese and English 3.71M inference model / trained model 2020.9.22 ch_tra chinese traditional 5.63M inference model / trained model 2021.1.21 en English 2.56M inference model / trained model 2020.9.22 fr French 2.65M inference model / trained model 2021.9.22 ar Arabic 2.53M inference model / trained model 2021.1.21 es Spanish 2.53M inference model / trained model 2021.1.21 pt Portuguese 2.63M inference model / trained model 2021.1.21 ru Russia 2.63M inference model / trained model 2021.1.21 ge german 2.65M inference model / trained model 2020.9.22 kr Korean 3.9M inference model / trained model 2020.9.22 jp Japanese 4.23M inference model / trained model 2020.9.22 it Italian 2.53M inference model / trained model 2021.1.21 hi Hindi 2.63M inference model / trained model 2021.1.21 ug Uyghur 2.63M inference model / trained model 2021.1.21 fa Persian 2.63M inference model / trained model 2021.1.21 ur Urdu 2.63M inference model / trained model 2021.1.21 oc Occitan 2.53M inference model / trained model 2021.1.21 mr Marathi 2.63M inference model / trained model 2021.1.21 ne Nepali 2.63M inference model / trained model 2021.1.21 rs_cyrillic Serbian(cyrillic) 2.63M inference model / trained model 2021.1.21 rs_latin Serbian(latin) 2.53M inference model / trained model 2021.1.21 bg Bulgarian 2.63M inference model / trained model 2021.1.21 uk Ukranian 2.63M inference model / trained model 2021.1.21 be Belarusian 2.63M inference model / trained model 2021.1.21 te Telugu 2.63M inference model / trained model 2021.1.21 kn Kannada 2.63M inference model / trained model 2021.1.21 ta Tamil 2.63M inference model / trained model 2021.1.21 mg Mongolian -- Ongoing bg Bangla -- Need dict and corpus vi Vietnamese -- Ongoing bm Burmese -- Need dict and corpus tr Turkish -- Need corpus po polish -- Need dict and corpus More TBC

Guideline for new language requests

If you want to request a new language support, a PR with 2 following files are needed：

In folder ppocr/utils/dict, it is necessary to submit the dict text to this path and name it with {language}_dict.txt that contains a list of all characters. Please see the format example from other files in that folder.

In folder ppocr/utils/corpus, it is necessary to submit the corpus to this path and name it with {language}_corpus.txt that contains a list of words in your language. Maybe, 50000 words per language is necessary at least. Of course, the more, the better.

If your language has unique elements, please tell me in advance within any way, such as useful links, wikipedia and so on.

Hi, thank you for the great work! I I sent you a corpus for Vietnamese. The file was attached below. vietnamese_dict.txt. This file gets from this research: Download: https://github.com/VinAIResearch/dict-guided You can evaluate on VinText dataset, text scene detection for Vietnamese, downloaded in Github. Thank you.

@inproceedings{m_Nguyen-etal-CVPR21,
      author = {Nguyen Nguyen and Thu Nguyen and Vinh Tran and Triet Tran and Thanh Ngo and Thien Nguyen and Minh Hoai},
      title = {Dictionary-guided Scene Text Recognition},
      year = {2021},
      booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition (CVPR)},
    }

Aug 11 '21 13:08 thongvhoang

Please add Bangla language. here are the dict and corpus:

dict corpus

Aug 16 '21 20:08 dynamicguy

PaddleOCR PaddleOCR copied to clipboard

Multilingual OCR Development Plan

Guideline for new language requests

Guideline for new language requests

Guideline for new language requests

PaddleOCR
PaddleOCR copied to clipboard