unidecode icon indicating copy to clipboard operation
unidecode copied to clipboard

Suggested changes, fixes and updates to Hebrew transliteration

Open eyaler opened this issue 2 years ago • 7 comments

I would like to ask for @alonbl feedback/greenlight before preparing my PR. I am interested in addressing several issues I see in the current Hebrew transliteration:

  1. 05ef (triple yod)- can now be transliterated as YYY
  2. seems inconsistent to me to have raffe as - and dagensh to '. if we are going by the graphics then dagesh should be . (dot). but i think a more useful choice would be to ignore both of them (as is currently done for the Shin-dots)
  3. Better alignment with Hebrew Language Academy rules (https://hebrew-academy.org.il/wp-content/uploads/taatik-ivrit-latinit-1-1.pdf): a. 05d7 ח is never transliterated as KH - a more standard-compliant version would be H (to differ from h) or h b. it is inconsistent to transliterate א as A and ע as back-tic. ע could be A or 'A or A'. but mind you all these choices including for א are non standard. also back-tic for ע is from the "exact standard", but we are otherwise following here the "simple standard" which uses '. I am really not sure what is the right thing to do here. we could also follow other languages and use the letter name in these cases: ALEPH and AYIN. c. using @ for schwa is consistent with the IPA symbol but it is not useful and not part of the hebrew standard which ignores schwa in transliteration (or in some cases uses e) d. ק should be k as in the simple standard (q is used in the exact standard)
  4. i am not sure what are 05f5, 05f6, 05f7 as they are not part of unicode afaict
  5. fixes in hebrew presentation forms (https://www.unicode.org/charts/PDF/UFB00.pdf) a. fb4f should be EL not l b. fb4e should be f not p c. fb4d should be KH not k d. fb4c should be v not b e. fb4b should be o not vo, similarly fb1d should by i not yi f. fix eg sh, ts to be SH, TS as done in regular letters g. fb47 should be k not ts (this is a mistake) h. fb41 should be s not n (this is a mistake) i. fb3e should be m not l (this is a mistake) j. fb30 currently missing should be i k. fb27 should be r not m (this is a mistake) l. add fb21, fb20 similar to the choices decided on for regular א, ע
  6. graphically sof-pasuk looks like : but for nlp tasks would be more useful to use "." or even ". " as this is the meaning of the punctuation.

eyaler avatar Jul 26 '21 01:07 eyaler