transliteration Separate function and data

T13n - is array of character pairs only, organized by manual or local t13n standards (character mappings/pairs). Some need only tool for converting strings by they own standart, some need pair collections to use in their own software.

I think, we need to seporate tool from data (character pairs).

May 14 '16 23:05 iegik

Could you explain more about this? I'm working on a change of the api to allow a custom charmap (can be set in configuration so all following calls will use that config without providing options each time) . Does it solve the problem? I'm also thinking to provide a plug-able data source but it needs to be well considered.

May 14 '16 23:05 yf-hk

dist/transliteration.js - function/API, that transliterates words or sentences dist/transliteration-data.js - set of character pairs (or formulas), how words should translated.

BTW, check wiki/Help:Multilingual_support_(Indic), to understand, that other languages transliterates in more complex way.

Function:

module.exports = function transliteration(data, req, res) {
    // ... calculations, depending of passed `data`
    return res;
}

Data:

var Cyrilic = require('./data/Cyrilic.iso9.ru');
var Hindi = require('./data/Hindi.remington.hi'); // for example only (http://www.findurlaptop.com/tech/2012/08/20/hindi-typing-and-google-transliteration/)
module.exports = Object.assign({}, Cyrilic, Hindi);

Example 1:

I want to use my own set of pairs:

var data = require('hulu-char-pairs');
var tr = require('transliteration').bind(null, data);
var res = tr(req);

Example 2:

I want to use my own coolest transliteration tool, but same character pairs data:

var data = require('transliteration.data');
var tr = require('my-transliteration').bind(null, data);
var res = tr(req);

May 14 '16 23:05 iegik

Thanks for the link. Yes it can be done in this way, if gziped it should be much smaller and probably in the future we need to cache it in localstorage or somewhere if it's run in the browser.

May 15 '16 00:05 yf-hk

I just tried it. After gzipped the transliteration.min.js file just weight about less than 80k. However, do you have more info on how to get different rules to transliterate different languages? And where can we get the data from? I can get some data for transliterating Chinese and Japanese which both have some cases that one character could have different pronunciation depending on context or words combination. How about other languages?

May 15 '16 00:05 yf-hk

http://www.findurlaptop.com/tech/2012/08/20/hindi-typing-and-google-transliteration/

May 15 '16 00:05 iegik

We need to change data grouping behavor. From beginning, I saw that data was grouped by char code, so to cover all characters you will need a lot of memory. But, do you really need that? Anyone need that?

I think - no. In Baltic states, for example, there are 5 common languages: LV, LT, EE, EN, RU. If I want to cover all transliteration of these languages, I should get few special characters from LV, LT, EE and all combinations from RU (becouse of non-latin alphabet).

May 15 '16 00:05 iegik

I think the memory usage is ok for now. It takes less than 2MB rss if you load all data into memory. Please try process.memoryUsage() to see. And for node, those files are conditionally loaded, means if you do not use them they are not required in the code

May 15 '16 01:05 yf-hk

@andyhu Any news on this? It is a great library but it bloats my bundle size a lot: ~300kb minified and looks like data takes almost all of this space. If I could bundle only languages I need, the bundle size would decrease significantly

Mar 01 '18 17:03 shrpne

There is another way... If you need only to remove accents in the latin languages, I`ll recommend to use String.prototype.normalize for now.

// shim for String.prototype.normalize https://github.com/walling/unorm
export default str => str.normalize('NFKD')
    .replace(/[\u0300-\u036f]/g, "") // remove accents
    .replace(/\u0142/g, "l"); // ł is a letter in itself

/*
const normalize = import('normalize.js')
console.log(normalize('ąśćńżółźćęāēūīšģķļ'))
*/

Mar 01 '18 21:03 iegik

With webpack the one could map required alphabet for transliteration to charmap.json

Example https://gist.github.com/ogonkov/bb415854f6a27e39471d391672e43003

May 21 '18 08:05 ogonkov

@ogonkov Nice idea, if I have time I'll first try to replace browserify with webpack. And maybe it's better to make different builds for different purpose and developer's own preference, so it can be more flexible.

Currently if you want you can replace the default character map database by using an undocumented API transliterate.setCharmap(). But you cannot get rid of the default one. The character map data is originally from ICU project, I know its quality is pretty low, but I can't find any better data source. Ideally the end-user should be able to choose which languages or unicode blocks they would like the module to support and load respective data. It requires a lot of work, especially high quality data.

Jul 15 '18 07:07 yf-hk

Actually, the most difficult thing is not code but data. I can't find enough data to support all different sorts of transliteration rules for each language. Probably I'll first make it more flexible, and let community to contribute the code mapping rules for each languages.

Jul 15 '18 07:07 yf-hk

Well, the solution above works for me pretty nice, it leaves only required data in json, and saves a lot in total

For Russian transliteration this charmap works smoothly

Jul 15 '18 07:07 ogonkov

I'm also thinking to compress the JSON file and unpack it in the browser, so it can save around 100+KB space for the default code map data.

Jul 15 '18 09:07 yf-hk

I think it's better not bundle JSON by default, and have all charmap keys as separate imports, which could be imported separately

Jul 15 '18 11:07 ogonkov

@ogonkov That's true, I'm thinking to separate code and data in v3 version

Jan 11 '19 14:01 yf-hk

2.x is breaking my solution above because now JSON data is bundled to JS file, that used it.

Feb 26 '20 13:02 ogonkov

I'll probably implement this in next version, but only if I can manage to get enough time

Mar 15 '20 10:03 yf-hk

What is the plan? How it should work?

Mar 16 '20 09:03 ogonkov

Like separating function and data as this issue suggested, and probably also adding an async method since sync operations may block the main thread for intensive usage. Also I'm planning to implement a new algorism of data storage. It should both reduce the package size to about 50% to 70% of the current size and also improve the performance a bit(hopefully). Anyway, considering my current workload, I might not be able to start working on it within 1-2 months, unless there's a sponsor.

Mar 17 '20 10:03 yf-hk

One of the proposals in this comment? https://github.com/dzcpy/transliteration/issues/14#issuecomment-219258081

Mar 17 '20 17:03 ogonkov

One of the proposals in this comment? #14 (comment)

I have to investigate a bit more first, but should be something similar.

Mar 18 '20 09:03 yf-hk

Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions?

As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right.

EDIT: PR #57 from 2017 seems to go in a similar direction.

May 05 '20 10:05 sun

Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions?

As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right.

EDIT: PR #57 from 2017 seems to go in a similar direction.

Yes, actually I used to be a Drupal developer myself (Drupal 6) It's an excellent module and gave me a lot of inspiration. And the original data was converted from PHP, but back then Drupal's transliteration module didn't have separate files. Maybe I'll take a look again. There are many errors in the data but I've been fixing them whenever anyone find one. Maybe we can share the data between the two project. I'm thinking to refactor the code from scratch, Just have't had enough time. As you said, a large and active community is necessary. Do you have any suggestins in building such a community?

May 09 '20 01:05 yf-hk