espeak-ng
espeak-ng copied to clipboard
Hebrew combining diacritic error
Hebrew diacritics sometimes generates incorrect pronunciations, resulting in eSpeak-NG trying to read out letters separately instead.
Reprex (1) פִּיל
píl "elephant"
-> f'e 'i _:(en)h'i:bru:_:d'agES(he) j'od l'amed
(on the online site)
vs p'il
(expected)
I'm not quite sure why this is happening, because the appropriate rules are included in he_rules
(e.g. פִּ
-> pi
). Interestingly, just פִּ
by itself also triggers this error. The issues also seem to arise with some other character–diacritic combinations, although I have not tested comprehensively.
Reprex (2) כַּבַּאי
kabaí "firefighter"
-> X'af 'a _:(en)h'i:bru:_:d'agES(he) v'et 'a _:(en)h'i:bru:_:d'agES(he) 'alef? j'od
(on the online site)
vs kaba'i
(expected)
Reprex (3) שָׁמַיִם
shamáyim "sky"
-> S 'a _:(en)h'i:bru:_:S'Ind0t_(he) m'em 'a j'od 'i m'em
(on the online site)
vs Sam'ajim
(expected)
I'm not sure why this happens. It might be that this is not a rules issue but instead the diacritics are not parsed correctly in the code.
For better debug output, use -X. It shows you which rules are selected.
espeak-ng -v he 'פִּיל
' -X
Translate '`'
Translate 'פִּיל'
36 פ [f]
57 פִ [fi]
Found: 'פ' [fe]
Translate 'ִ'
36 ִ [i]
Translate 'ּ'
Found: '_he' [h'i:bru:]
Found: '_ּ' [d'agES]
Found: 'י' [jod]
Found: 'ל' ['lamed]
Translate '`'
f'e 'i :(en)h'i:bru::d'agES(he) j'od l'amed
Reason may be that פִּ
character is composite with zero width diacritic mark and it can be written either as U+05BC+U+05B4
or U+05B4+U+05BC
, which look the same, but rule is only for U+05BC+U+05B4
(see attached picture).
For Hebrew, similarly to Arabic, all non-canonical character order should be replaced to canonical order (character + diacritic mark), e.g.
.replace
פִּ פִּ
What looks the same, but it replaces U+05B4+U+05BC
to U+05BC+U+05B4
.
Yes, that works. After the replace:
espeak-ng -v he פִּיל -X Replace: פִּ > פִּ Translate 'פּיל' 57 פּ [p] 78 פּי [pi] 36 פ [f]
36 ל [l]
p'il
Any ideas on how to automate this procedure? I only have experiene with batch processing ASCII. Doing it by hand will definately lead to mistakes.
Hebrew is missing a lot of backend code. Thus, hebrew rules are defaulting to English. We should see what the code for other semitic languages (am, ar, mt) looks like.
- espeak-ng-data/lang/sem/he is almost empty.
- src/libespeak-ng/tr_languages.c has no case L('h', 'e'):
Hey, hebrew just spells out the letters for me, with symbols or without. Is it related? Regarding the android tts engine