tessdata_fast
tessdata_fast copied to clipboard
kur_ara does not have Arabic unicharset.
Please see details at
https://github.com/tesseract-ocr/tessdata/pull/88#issuecomment-375644263
https://github.com/tesseract-ocr/langdata/issues/116
https://github.com/tesseract-ocr/tessdata_best/issues/23
@jbreiden @alexanderP - FYI - regarding problem with packaged traineddata for kur_ara.
@Shreeshrii I understood correctly. trainedata need to change in packages? tesseract-ocr-kur-ara -> tesseract-ocr-kur tesseract-ocr-kur -> tesseract-ocr-kur-ara
There is no traineddata for kur in tessdata_fast.
I will unpack and convert the dawgs to word list and see if it is possible to correct kur_ara files.
Please do not make any change yet.
On Sat 24 Mar, 2018, 12:06 PM Alexander Pozdnyakov, < [email protected]> wrote:
@Shreeshrii https://github.com/Shreeshrii I understood correctly. trainedata need to change in packages? tesseract-ocr-kur-ara -> tesseract-ocr-kur tesseract-ocr-kur -> tesseract-ocr-kur-ara
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14#issuecomment-375850931, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0HXWqF78k3jpUbGt-n-WlMK9Fwzks5thelYgaJpZM4S4nxt .
@AlexanderP
tesseract-ocr-kur-ara -> tesseract-ocr-kur
Yes, the above change can be made. Currently kur_ara has Latin text only.
tesseract-ocr-kur -> tesseract-ocr-kur-ara
This cannot be done since there is no kur traineddata in tessdata_fast.
@jbreiden @theraysmith
Should I build kur_ara from the ara.traineddata eg. by replacing the wordlist?
Or is there an updated set of Arabic script traineddatas that can be uploaded before 4.0.0 release?
ref: https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-320818183
I was going to push until I discovered a bug with the RTL word lists. Then I also need to integrate this issues list, that I haven't looked at in a while, and rerun training.
Maybe it should be 'kur_lat'.
There is no traineddata for kur in tessdata_fast. I will unpack and convert the dawgs to word list and see if it is possible to correct kur_ara files. Please do not make any change yet.
ok
Was this issue solved by the renaming?
kmr is Kurdish in Latin script. Renaming has fixed that issue.
kur was Kurdish in Arabic script in Tesseract3. We have still not restored kur or kur_ara.
So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?
That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.
On Thu, Dec 19, 2019, 19:54 Stefan Weil [email protected] wrote:
So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ .
I will try to recreate using the wordlist and training text by fine-tuning.
On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar [email protected] wrote:
That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.
On Thu, Dec 19, 2019, 19:54 Stefan Weil [email protected] wrote:
So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ .
https://github.com/Shreeshrii/tesstrain-ckb
ckb is the preferred prefix rather than kur_ara
My finetuned training gives improved results compared to official ara and script/Arabic traineddata on the synthetic eval set.
On Fri, Dec 20, 2019 at 8:05 PM Shree Devi Kumar [email protected] wrote:
I will try to recreate using the wordlist and training text by fine-tuning.
On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar [email protected] wrote:
That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.
On Thu, Dec 19, 2019, 19:54 Stefan Weil [email protected] wrote:
So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com