Use the Unicode CLDR Transliteration Guidelines to generate slugs
A simple lookup table is currently used to transliterate Unicode codepoints into Latin characters. While this is acceptably accurate for some languages (e.g. chinese characters to pinyin), it fails badly in languages with more complex transliteration rules.
Thankfully, the Unicode CLDR Transliteration Guidelines has done the heavy lifting.
Most platforms tap into this data through the ICU4C library. Personally I think the API is excellent, but calling C code from Erlang/Elixir has always been iffy.
I’m keen to hear what others have to say.
Now that I have finished up Unicode regex's and Unicode Sets I can start on Unicode Transforms - its a non trivial exercise so I expect it will take a couple of months. I will update here as I make progress.
Wonderful, would really appreciate that! I have my eye on using Unicode.String.split for separators since you’ve done that, but figured I’ll do this update in one shot.
I've worked out most of the design which, as usual with CLDR, is largely about parsing and transforming source data into an executable format. As I look through the available transforms there isn't a standard one that is "take every script and convert it to URL string".
There are several which convert to Latin and then there is a transform to Latin to Ascii.
I think there are a few choices to be made and your input will be very helpful:
- Build a pipeline that basically runs all the known transforms to Latin, then Latin to ASCII. But I worry about performance
- Narrow the transform list, a bit as you do today and you decide which ones you want to apply
- Configure the transforms to be used in some kind of backend module (a la Gettext and Cldr) or via
Config.exslike the "old days".
One thing I know for sure is that it must be possible to define ones own transform pipeline so one choice is to simply decide what you want to use as a transform pipeline and then that will automatically be compiled to Elixir code.
Any thoughts or preferences? The transforms available with CLDR are here
If you check my icu branch, transliterating Any-Latin first then Latin-ASCII works. I’m not sure how Any is implemented though.
The Any-Latin; Latin-ASCII; Lower compound transform exactly describes what this library does.
Arrgggh, apologies for not checking your ICU branch. Clear now, thanks.
Turns out CLDR doesn’t supply an Any-Latin transform so it must be an internal transform to ICU4c - which I’ll go looking for in the source.