TextRecognitionDataGenerator icon indicating copy to clipboard operation
TextRecognitionDataGenerator copied to clipboard

hindi text render incorrect

Open KosukeHao opened this issue 5 years ago • 8 comments

When I try to use this repo to generate hindi text imgs, it works well. But actually all characters are incorrect because PIL.trutype could only render a simple character-to-glyph but hindi text is complex.

here is the same problem: https://github.com/python-pillow/Pillow/issues/3191

KosukeHao avatar Jun 27 '19 04:06 KosukeHao

So it's pretty much like arabic where ligatures were not used properly.

I'll take a look at that Pillow issue, they seem to have linked to a few library that address the issue.

Thank you for reporting this.

Belval avatar Jun 27 '19 10:06 Belval

@Belval @KosukeHao how can I generate Hindi images. It is mentioned that language that we can use should be in French, English, Spanish, German or Chinese

dwivediagam avatar Mar 04 '20 16:03 dwivediagam

There is no solution as of today. Unfortunately, the ligatures are still not supported and needed dependencies are unknown.

I would be very interested if you can make a PR that has a working sample. I would love to work on it myself, but I do not know Hindi and cannot see the difference between good and bad samples. If you can provide clear examples of what TRDG should output given a specific input I could give it another try.

Belval avatar Mar 04 '20 18:03 Belval

Installing libraqm as suggested here should help in most cases: https://stackoverflow.com/questions/39630916/

Edit:
Sometimes installing libraqm causes the following error:
OSError: invalid face handle Sometimes it just works well. Not sure what could be the cause.

GokulNC avatar Nov 04 '20 08:11 GokulNC

Is this issue solved? I would like to use the library to generate in Indian Punjabi (Gurmukhi) language which is similar to Hindi (Devnagari). Please let me know if you need some help in Punjab language.

sanbroz avatar Jul 15 '21 05:07 sanbroz

ਚੋਭਾ ਸਾਥਣ ਸਨੋਲੀ ਸ਼ੋਨਫੋਲ ਦਰਦਰਾ_8

as you can see it is adding spaces between characters, actual text is ਚੋਭਾ ਸਾਥਣ ਸਨੋਲੀ ਸ਼ੋਨਫੋਲ ਦਰਦਰਾ

I used these parameters to generate -l pb -c 10 -w 5 -f 64 -dt dicts\pb.txt

sanbroz avatar Jul 15 '21 07:07 sanbroz

This might help: https://github.com/Belval/TextRecognitionDataGenerator/pull/164#issuecomment-732970029

GokulNC avatar Jul 15 '21 07:07 GokulNC

Thanks Balval & GokuINC, enable the --word_split & libraqm seems to solve the problem

ਉੱਪੁਰ ਟਾਇਰੀ leaves ਤੱਕ ਮੁਫ਼ਤ_6

Label: ਉੱਪੁਰ ਟਾਇਰੀ leaves ਤੱਕ ਮੁਫ਼ਤ_6

sanbroz avatar Jul 15 '21 09:07 sanbroz