espeak-ng icon indicating copy to clipboard operation
espeak-ng copied to clipboard

Hebrew combining diacritic error

Open alvinwmtan opened this issue 2 years ago • 4 comments

Hebrew diacritics sometimes generates incorrect pronunciations, resulting in eSpeak-NG trying to read out letters separately instead.

Reprex (1) פִּיל píl "elephant" -> f'e 'i _:(en)h'i:bru:_:d'agES(he) j'od l'amed (on the online site) vs p'il (expected)

I'm not quite sure why this is happening, because the appropriate rules are included in he_rules (e.g. פִּ -> pi). Interestingly, just פִּ by itself also triggers this error. The issues also seem to arise with some other character–diacritic combinations, although I have not tested comprehensively.

alvinwmtan avatar Mar 31 '22 00:03 alvinwmtan

Reprex (2) כַּבַּאי kabaí "firefighter" -> X'af 'a _:(en)h'i:bru:_:d'agES(he) v'et 'a _:(en)h'i:bru:_:d'agES(he) 'alef? j'od (on the online site) vs kaba'i (expected)

Reprex (3) שָׁמַיִם shamáyim "sky" -> S 'a _:(en)h'i:bru:_:S'Ind0t_(he) m'em 'a j'od 'i m'em (on the online site) vs Sam'ajim (expected)

alvinwmtan avatar Mar 31 '22 02:03 alvinwmtan

I'm not sure why this happens. It might be that this is not a rules issue but instead the diacritics are not parsed correctly in the code.

For better debug output, use -X. It shows you which rules are selected.

espeak-ng -v he 'פִּיל' -X Translate '`' Translate 'פִּיל' 36 פ [f] 57 פִ [fi]

Found: 'פ' [fe]
Translate 'ִ' 36 ִ [i]

Translate 'ּ' Found: '_he' [h'i:bru:]
Found: '_ּ' [d'agES]
Found: 'י' [jod]
Found: 'ל' ['lamed]
Translate '`' f'e 'i :(en)h'i:bru::d'agES(he) j'od l'amed

jaacoppi avatar Mar 31 '22 03:03 jaacoppi

Reason may be that פִּ character is composite with zero width diacritic mark and it can be written either as U+05BC+U+05B4 or U+05B4+U+05BC, which look the same, but rule is only for U+05BC+U+05B4 (see attached picture). Att26 For Hebrew, similarly to Arabic, all non-canonical character order should be replaced to canonical order (character + diacritic mark), e.g.

.replace

    פִּ פִּ

What looks the same, but it replaces U+05B4+U+05BC to U+05BC+U+05B4.

valdisvi avatar Mar 31 '22 19:03 valdisvi

Yes, that works. After the replace:

espeak-ng -v he פִּיל -X Replace: פִּ > פִּ Translate 'פּיל' 57 פּ [p] 78 פּי [pi] 36 פ [f]

36 ל [l]

p'il

Any ideas on how to automate this procedure? I only have experiene with batch processing ASCII. Doing it by hand will definately lead to mistakes.

jaacoppi avatar Apr 01 '22 03:04 jaacoppi

Hebrew is missing a lot of backend code. Thus, hebrew rules are defaulting to English. We should see what the code for other semitic languages (am, ar, mt) looks like.

  • espeak-ng-data/lang/sem/he is almost empty.
  • src/libespeak-ng/tr_languages.c has no case L('h', 'e'):

jaacoppi avatar Oct 11 '22 06:10 jaacoppi

Hey, hebrew just spells out the letters for me, with symbols or without. Is it related? Regarding the android tts engine

SolainOG avatar Dec 08 '22 23:12 SolainOG