ecma402 icon indicating copy to clipboard operation
ecma402 copied to clipboard

Support ICU transliteration in ECMAScript - rough draft

Open sven-oly opened this issue 3 years ago • 0 comments

ICU4C's transliterator class provides transformation capabilities for text. Given a set of transliteration rules, a transliterator can transform text into other forms including different writing systems.

Transforms can be much more sophisticated than simple one-to-one character substitution. For example, conversion from Serbian in Cyrillic to Latin script can be supported by such a rule set with the transliteration engine. CLDR defines may transforms for use in C++ and Java implementations as well as in the Python package PyICU.

Implementing the full capabilities of ICU4C's transliterator in JavaScript is complicated. However, including it in ECMAScript's Intl object may be straightforward.

I propose that ECMAScript provide an API that makes the transformation capabilities available in web browsers. Such an implementation could be provided without storing all the data existing CLDR transforms. Specific rule sets would be provided by the website or server side for its particular needs. Such rule sets could be obtained from CLDR data or could be customized for many text processing purposes needed by applications.

A NodeJS version with CPP bindings is available fromLongNow s also available. However, this author is not aware of any implementation with the full power and flexibility of ICU's version

Other examples include transformation of Myanmar text to Latin and to International Phonetic Alphabet (IPA) characters. And transliteration rules are also used for converting text from the non-Unicode text into Unicode, as exemplified by the CLDR data for Zawgyi encoding, widely used for Burmese language on the web.

Another example is Google Translate's output for some languages such as Burmese, Hindi, and others, giving a Latin script version of Myanmar text as part of the translation interface.

An ECMAScript-based transliterator would be useful for multilingual websites and those that support multiple scripts for a language. This could be used by web apps to better support languages that use multiple scripts as as allowing enabling pronunciation guides for readers.

API functions supported would include:

  1. Instantiating a transliteration object, given a rule set
  2. Providing a transliteration of input text
  3. Provide an inverse transliteration based on current rules
  4. getDisplayName for the transliterator that is appropriate for a given locale
  5. handleGetSourceSet, returning the set of all characters that can be modified in the input text.
  6. filters may also be applied as needed by the application

We could also add functions to use existing transliteration rule sets that ECMAScript may provide, using inputs such as language and script to match needed input / output criteria.

sven-oly avatar Apr 05 '21 21:04 sven-oly