tessdata_fast icon indicating copy to clipboard operation
tessdata_fast copied to clipboard

kur_ara does not have Arabic unicharset.

Open Shreeshrii opened this issue 7 years ago • 12 comments

Please see details at

https://github.com/tesseract-ocr/tessdata/pull/88#issuecomment-375644263

https://github.com/tesseract-ocr/langdata/issues/116

https://github.com/tesseract-ocr/tessdata_best/issues/23

@jbreiden @alexanderP - FYI - regarding problem with packaged traineddata for kur_ara.

Shreeshrii avatar Mar 23 '18 12:03 Shreeshrii

@Shreeshrii I understood correctly. trainedata need to change in packages? tesseract-ocr-kur-ara -> tesseract-ocr-kur tesseract-ocr-kur -> tesseract-ocr-kur-ara

AlexanderP avatar Mar 24 '18 06:03 AlexanderP

There is no traineddata for kur in tessdata_fast.

I will unpack and convert the dawgs to word list and see if it is possible to correct kur_ara files.

Please do not make any change yet.

On Sat 24 Mar, 2018, 12:06 PM Alexander Pozdnyakov, < [email protected]> wrote:

@Shreeshrii https://github.com/Shreeshrii I understood correctly. trainedata need to change in packages? tesseract-ocr-kur-ara -> tesseract-ocr-kur tesseract-ocr-kur -> tesseract-ocr-kur-ara

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14#issuecomment-375850931, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0HXWqF78k3jpUbGt-n-WlMK9Fwzks5thelYgaJpZM4S4nxt .

Shreeshrii avatar Mar 24 '18 10:03 Shreeshrii

@AlexanderP

tesseract-ocr-kur-ara -> tesseract-ocr-kur

Yes, the above change can be made. Currently kur_ara has Latin text only.

tesseract-ocr-kur -> tesseract-ocr-kur-ara

This cannot be done since there is no kur traineddata in tessdata_fast.

Shreeshrii avatar Mar 24 '18 11:03 Shreeshrii

@jbreiden @theraysmith

Should I build kur_ara from the ara.traineddata eg. by replacing the wordlist?

Or is there an updated set of Arabic script traineddatas that can be uploaded before 4.0.0 release?

ref: https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-320818183

I was going to push until I discovered a bug with the RTL word lists. Then I also need to integrate this issues list, that I haven't looked at in a while, and rerun training.

Shreeshrii avatar Mar 24 '18 12:03 Shreeshrii

Maybe it should be 'kur_lat'.

amitdo avatar Mar 24 '18 12:03 amitdo

There is no traineddata for kur in tessdata_fast. I will unpack and convert the dawgs to word list and see if it is possible to correct kur_ara files. Please do not make any change yet.

ok

AlexanderP avatar Mar 25 '18 09:03 AlexanderP

Was this issue solved by the renaming?

stweil avatar Dec 17 '19 18:12 stweil

kmr is Kurdish in Latin script. Renaming has fixed that issue.

kur was Kurdish in Arabic script in Tesseract3. We have still not restored kur or kur_ara.

Shreeshrii avatar Dec 19 '19 12:12 Shreeshrii

So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?

stweil avatar Dec 19 '19 14:12 stweil

That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.

On Thu, Dec 19, 2019, 19:54 Stefan Weil [email protected] wrote:

So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ .

Shreeshrii avatar Dec 19 '19 14:12 Shreeshrii

I will try to recreate using the wordlist and training text by fine-tuning.

On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar [email protected] wrote:

That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.

On Thu, Dec 19, 2019, 19:54 Stefan Weil [email protected] wrote:

So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ .

Shreeshrii avatar Dec 20 '19 14:12 Shreeshrii

https://github.com/Shreeshrii/tesstrain-ckb

ckb is the preferred prefix rather than kur_ara

My finetuned training gives improved results compared to official ara and script/Arabic traineddata on the synthetic eval set.

On Fri, Dec 20, 2019 at 8:05 PM Shree Devi Kumar [email protected] wrote:

I will try to recreate using the wordlist and training text by fine-tuning.

On Thu, Dec 19, 2019, 19:59 Shree Devi Kumar [email protected] wrote:

That will not work since Arabic script in 3.04 relied on cube which is no longer in codebase.

On Thu, Dec 19, 2019, 19:54 Stefan Weil [email protected] wrote:

So you suggest to restore https://github.com/tesseract-ocr/tessdata/blob/3.04.00/kur.traineddata to the master branch of tessdata?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata_fast/issues/14?email_source=notifications&email_token=ABG37IYSSSV6JYSTZESZLS3QZN7YPA5CNFSM4EXCPRW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHJYGLI#issuecomment-567509805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I5JBM674S5VNJE7JVDQZN7YPANCNFSM4EXCPRWQ .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii avatar Jan 29 '20 10:01 Shreeshrii