`text/slugify` gives empty results for non-Latin alphabets
Describe the bug
slugify gives readable results for only a subset of languages, namely those that use the Latin alphabet (English, Spanish, Vietnamese...). For all other languages (Chinese, Arabic, Russian...), it just returns an empty string.
Steps to Reproduce
slugify('三人行,必有我师焉') // returns ""
Expected behavior
slugify to return non-Latin-alphabet words unchanged (but still stripping start/end punctuation and replacing medial punctuation with dashes).
Something like the following:
function intlFriendlySlugify(str: string) {
return str
.toLowerCase()
.normalize()
.replaceAll(/[^\p{L}\p{M}\p{N}]+/gu, '-')
.replaceAll(/^-|-$/g, '')
}
IMO the slugify function should not handle percent encoding, instead leaving that up to the calling code (e.g. passing to URL constructor or URLSearchParams):
const slug = intlFriendlySlugify('三人行,必有我师焉') // "三人行-必有我师焉"
const origin = 'https://example.com'
const url1 = new URL(origin)
url1.pathname = join('blog', slug)
const url2 = new URL(origin)
url2.searchParams.set('q', slug)
const url3 = new URL(origin)
url3.hostname = `${slug}.example.com`
url1: https://example.com/blog/%E4%B8%89%E4%BA%BA%E8%A1%8C-%E5%BF%85%E6%9C%89%E6%88%91%E5%B8%88%E7%84%89url2: https://example.com/?q=%E4%B8%89%E4%BA%BA%E8%A1%8C-%E5%BF%85%E6%9C%89%E6%88%91%E5%B8%88%E7%84%89.url3: https://xn----io6a1vz30ct6b22enxlhzue11c.example.com/
While it's true they're unreadable in the percent-encoded/punycode form, modern browsers display the human-readable form automatically on hover, or in the address bar when you open them. At some point it might also be worth adding a prettifyUrl function to allow similar conversion in userland, but I'm leaving that out of scope for this issue.
Environment
OS: Ubuntu 20.04, WSL deno version: 1.46.0 std version: [email protected]
Here's a rundown of how various platforms handle non-ASCII text in slugs:
| Site | Diacritics | Non-Latin | Notes |
|---|---|---|---|
stackoverflow.com |
cartão de credito → cartão-de-creditoUnchanged |
Python で特定の文字 → python-で特定の文字Unchanged |
|
wikipedia.org |
Maria Angélica Beraldo → Maria_Angélica_BeraldoUnchanged |
佩通坦·钦那瓦 → 佩通坦·钦那瓦Unchanged |
Wikipedia uses _ instead of - for spaces, but it's still a slug of sorts |
tumblr.com |
wüst → wüstUnchanged |
Word of the Day: 久违 (Chinese) → word-of-the-day-久违-chineseUnchanged |
|
wordpress.org |
Actualización de mantenimiento → actualizacion-de-mantenimientoStripped |
WordPress 6.6「Dorsey」发布 → wordpress-6-6dorsey发布Unchanged |
|
github.com |
Introducción → introducciónUnchanged |
7. 兩岸詞典 /c/ → 7-兩岸詞典-cUnchanged |
Slugified section titles in README files in URL hash |
medium.com |
Cómo salirse -> como-salirseStripped |
跟我学中文! -> 跟我学中文Unchanged |
|
dev.to |
2 años como Front-End Developer → 2-anos-como-front-end-developerStripped |
データ・ストリーミング技術の概要 → detasutoriminguji-shu-nogai-yaoTransliterated |
This illustrates some of the problems with stripping diacritics and transliterating — 2 años means "2 years" whereas 2-anos means 2-anuses; meanwhile 技術 should be gijutsu in Japanese, not ji-shu |
The transliteration option has one big advantage, namely that the URL remains legible in any context: plaintext files, IM platforms with limited rich-text features, etc. It also typically leads to shorter URLs compared to the percent-encoded version. Still, it's strictly worse when viewed in a browser address bar, adds a massive amount of complexity, including mappings for thousands of CJK characters, and often still leads to suboptimal results (as seen in the dev.to examples). That's probably why only dev.to uses it out of the 7 platforms I looked at.
As for diacritics, 3 of the 7 platforms strip them from Latin-script text, while the other 4 keep them. As with transliteration, stripping leads to more plaintext-friendly URLs; however, diacritics can be semantically important, also illustrated by the dev.to example.
In the initial implementation it was discussed to port npm:slugify, so slugify('三人行,必有我师焉') === "" is actually expected.
The behavior you describe is probably better handled with @std/text/to-kebab-case
import { toKebabCase } from "@std/text/to-kebab-case";
console.log(toKebabCase("三人行-必有我师焉")); // "三人行-必有我师焉"
@timreichen In the state it was merged, that PR isn't a port of npm:slugify, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to have slugify be a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr? npm:slugify has multiple open issues about its lack of support for various languages.
And if toKebabCase already works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separate slugify that only works for a subset of use cases?
@timreichen In the state it was merged, that PR isn't a port of
npm:slugify, because it doesn't include any char mapping. But more to the point, why is it more important to have parity with a specific NPM package rather than to haveslugifybe a general-purpose function for creating slugs, as implemented by sites like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr?npm:slugifyhas multiple open issues about its lack of support for various languages.
We removed the char mapping because the list was random. The problem is as you pointed out that there is no standard and the slugify functionality varies depending on the implementation.
And if
toKebabCasealready works reliably as a general-purpose slugify function (which I'm not sure about, haven't tested it), why expose a separateslugifythat only works for a subset of use cases?
Every slugify function will be only work on a subset of use cases. That is why for example npm:slugify has so many options and one can add custom replacements etc.
I think if there is a clean way to support other languages, that should be added.
However, the slug must match [a-zA-Z0-9-]*.
However, the slug must match [a-zA-Z0-9-]*.
Why? Again, massive platforms like WordPress, Stack Overflow, Wikipedia, GitHub, Medium, and Tumblr don't obey that rule, and browsers and web APIs handle non-ASCII URL components perfectly fine. Allowing "" as a slug is far more risky sanitization-wise than allowing arbitrary non-ASCII text, because the path /a//b normalizes to /a/b (and additionally, /a/ is often normalized to /a). Meanwhile, non-ASCII text can never clash with reserved characters, which always fall within the printable ASCII range.
Maybe we could just change the signature of slugify so users can provide their own strip regex ?
function slugify(input: string, strip = /[^a-zA-Z0-9\s-]/g): string
This way it doesn't really add much more complexity while offering a bit more liberty to end users (which would know best which charset they'd like to support) ?
slugify("déjà-vu", /[^a-zA-Z0-9\s-À-ÖØ-öø-ÿ]/g) // "déjà-vu"
@lowlighter That seems to me like it's simultaneously too granular and not customizable enough. Too granular because I can't see any good reason why you'd want to allow some non-ASCII but not others; not customizable enough because it still doesn't provide any way of mapping.
Something like this could work:
// slugify.ts
export type SlugifyOptions = {
/** @default {undefined} */
charMap: Record<string, string> | undefined,
/** @default {Boolean(options.charMap)} */
stripUnknown: boolean,
/** @default {Boolean(options.charMap || options.stripUnknown)} */
stripDiacritics: boolean,
}
export function slugify(input: string, options?: Partial<SlugifyOptions>): string
// slugify_char_map.ts
// A comprehensive char mapping (transliteration) from some decently authoritative source
export const charMap = {
// ...
я: "ya",
// ...
鼎: "ding",
// ...
}
If you really want to opt-in to the "nuke everything other than Basic Latin" option for some reason, you could still do that with slugify(..., { stripUnknown: true }) or slugify(..., { charMap: {} }).
As for "decently authoritative source" for the char map, I'm not sure what that would be. https://unicode-org.github.io/icu/userguide/transforms/general/ provides some notes on transliteration, which suggest that a simple charMap isn't really sufficient, but it looks like implementing proper transliteration is pretty complicated, so a char map could end up being the least-worst option (other than the actual least-worst option, which is just relying on percent-encoding to do its thing 😜)
I'm happy to defer to the consensus of others with this issue, but if we go down the route of having a character map, best it be a Map<string, string>.
Looks like the requisite ICU data for transliteration is here: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/translit/. With some truly disgusting regex-based """parsing""" of the data files, Intl.Segmenter-based word segmentation, and some gibberish strings of words in various languages, I can get decent-ish results (this uses ~500kb of very un-optimized mapping data for the char map):
[vi]
Bằng khác byte phần bảng ký sun hợp của tự.
bang-khac-byte-phan-bang-ky-sun-hop-cua-tu
[zh]
解决始终这些统一,大部分,既成事实节字体编码的。
jiejue-shizhong-zhexie-tongyi-dabufen-chengshishi-jie-ziti-bianma-de
[de]
Ging loslassen Steuerzeichen in auf übersetzt, Phaistos Pau und Ugaritisch.
ging-loslassen-steuerzeichen-in-auf-ubersetzt-phaistos-pau-und-ugaritisch
[es]
Modificaciones equivalencia que esquemas alquímicos ha bits vez de una.
modificaciones-equivalencia-que-esquemas-alquimicos-ha-bits-vez-de-una
[ru]
Бит данных частично характерный текст, на в с клавиатуры из.
bit-dannyx-chastichno-xarakternyi-tekst-na-v-s-klaviatury-iz
[ar]
رمز والأسلوب العالم، متناسقة يتكون الراغبون، التفريق كترميزات في النصوص.
rmz-walaslwb-alalm-mtnasqh-ytkwn-alraghbwn-altfryq-ktrmyzat-fy-alnsws
[ja]
韓国利用運用文字のをれドキュメントマッピングと。
hanguo-liyong-yunyong-wenzi-no-o-re-tokyumentomahin-ku-to
[el]
Υπολογιστή ωστόσο αφήνει όλες βασίζονται προβλήματα στο κωδικοποίησης την για.
ypologiste-ostoso-apheni-oles-basizondai-problemata-sto-kodikopoieses-ten-gia
Limitations: Japanese will always give bad results for Kanji; Arabic lacks most vowels (I think that's due to the vowels not being indicated in the first place, so no way round that); the Greek is currently based on Ancient Greek transliteration, but I think that can be fixed by ignoring certain input files.
OK, switched to using a custom peggy grammar to parse the ICU mappings, and now getting what seem to be decent results now for all the languages I've tested (as good/better than results from other general-purpose transliteration libraries I've looked at).
Comparison:
| Language | Example | @std/slugify@next (with charMap option) |
npm:slugify | npm:transliterate |
|---|---|---|---|---|
| Amharic | በርበሬን ከላመ ከሞተ አግኝተሽው ዋጥ ስልቅጥ አድርገሽ ከምኔው ጨረሽው | beribereni-kelame-kemote-ginyiteshiwi-wati-silikiti-dirigeshi-keminewi-chereshiwi | barebareene-kalaama-kamota-agenyetashewe-waathe-seleqethe-aderegashe-kameneewe-charashewe | |
| Arabic | الحركة الدولية للدفاع عن الأطفال الفلسطينين ضد بايدن | alhrkh-aldwlyh-lldfa-n-alatfal-alflstynyn-zd-baydn | alhrkh-aldwlyh-lldfaa-an-alatfal-alflstynyn-dhd-baydn | lhrk-ldwly-lldfaa-aan-l-tfl-lflstynyn-dd-bydn |
| German | Leichtathletik-Weltmeisterschaften 2007/Teilnehmer (Liechtenstein) | leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein | Leichtathletik-Weltmeisterschaften-2007Teilnehmer-(Liechtenstein) | leichtathletik-weltmeisterschaften-2007-teilnehmer-liechtenstein |
| Greek | Βραβείο Καλύτερου Διευθυντή Φωτογραφίας της Ένωσης Διαδικτυακών Κριτικών Κινηματογράφου | vravio-kaliterou-dhievthindi-fotografias-tis-enosis-dhiadhiktiakon-kritikon-kinimatografou | Brabeio-Kalyteroy-Diey8ynth-Fwtografias-ths-Enwshs-Diadiktyakwn-Kritikwn-Kinhmatografoy | vraveio-kalyteroy-dieythynti-fotografias-tis-enosis-diadiktyakon-kritikon-kinimatografoy |
| Spanish | Temporada 2018 del Campeonato Brasileño de Motovelocidade | temporada-2018-del-campeonato-brasileno-de-motovelocidade | Temporada-2018-del-Campeonato-Brasileno-de-Motovelocidade | temporada-2018-del-campeonato-brasileno-de-motovelocidade |
| Hindi | संयुक्त अरब अमीरात क्रिकेट टीम का स्कॉटलैंड दौरा 2016 | samyaukata-araba-amairaata-karaikaeta-taima-kaa-sakaotalaaimda-daauraa-2016 | 2016 | snyukt-arb-amiiraat-krikett-ttiim-kaa-skonttlaindd-dauraa-2016 |
| Icelandic | Alfreð Clausen syngur lög eftir Jenna Jónsson | alfred-clausen-syngur-log-eftir-jenna-jonsson | Alfred-Clausen-syngur-log-eftir-Jenna-Jonsson | alfred-clausen-syngur-log-eftir-jenna-jonsson |
| Japanese | コンティニュイング・ケア・リタイアメント・コミュニティ | konte-ni-ingu-kea-ritaiamento-komyunite | konteiniyuingukearitaiamentokomiyunitei | |
| Russian | 500 величайших альбомов всех времён по версии журнала Rolling Stone | 500-velichayshix-albomov-vsex-vremyon-po-versii-jurnala-rolling-stone | 500-velichajshih-albomov-vseh-vremyon-po-versii-zhurnala-Rolling-Stone | 500-velichayshih-albomov-vseh-vremyon-po-versii-zhurnala-rolling-stone |
| Thai | จังหวัดมุกดาหารในการเลือกตั้งสมาชิกสภาผู้แทนราษฎรไทยเป็นการทั่วไป พ.ศ. 2562 | canghwad-mukdahar-in-kar-eluxk-tang-smachik-spha-phu-aethn-radr-ithy-epnkar-thawip-ph-s-2562 | ..-2562 | cchanghwadmukdaahaarainkaareluue-ktangsmaachiksphaaphuuaethnraasdraithyepnkaarthawaip-ph.s.-2562 |
| Vietnamese | Cục Phát thanh, truyền hình và thông tin điện tử (Việt Nam) | cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam | Cuc-Phat-thanh-truyen-hinh-va-thong-tin-djien-tu-(Viet-Nam) | cuc-phat-thanh-truyen-hinh-va-thong-tin-dien-tu-viet-nam |
| Chinese | 2020年夏季奧林匹克運動會輕艇女子500公尺單人愛斯基摩艇比賽 | 2020-nian-xiaji-aolinpike-yundonghui-qing-ting-nuzi-500-gongchi-danren-aisijimo-ting-bisai | 2020500 | 2020nian-xia-ji-ao-lin-pi-ke-yun-dong-hui-qing-ting-nu-zi-500gong-chi-dan-ren-ai-si-ji-mo-ting-bi-sai |
npm:slugify gives empty results for Amharic, Hindi, Thai, Chinese, and Japanese, and also has some extremely questionable choices for Greek (θ → 8, ω → w, η → h). npm:transliterate gives more concise (possibly better?) results for Hindi but gives even less vowel-ey results for Arabic, lacks spacing for Japanese/Thai (I added ZWSPs to stop the table breaking the layout), and has suboptimal spacing for Chinese.
Char map is ~213KB (un-minified, un-gzipped).
IMO those are "good enough" results at this stage (given that the default will be not to transliterate), but it'd be good to get some input from speakers of a few more of these languages. You can also try it out with other languages here: https://dash.deno.com/playground/slugify
Upon testing more languages (list taken from npm:any-ascii's examples), we're still missing Braile (which I think is safe to omit as I think web content written in Braile must be vanishingly rare? Someone please correct if I'm wrong) and at least 3 South-East Asian languages (Burmese/Myanmar, Khmer, Lao). Also the Korean example looks a bit sus compared to the other versions. Also npm:any-ascii now seems to be best-in-class for JS transliteration, at least from the ones I've found.
While I personally found these researches interesting, I think it's difficult to do these transliterations in an unopinionated way. Also it seems difficult to maintain them as the maintainers are not knowledgeable about many of these languages. I'd consider the handlings of non-latin alphabet languages are out of scope of this API.
I'd consider the handlings of non-latin alphabet languages are out of scope of this API.
@kt3k My main concern isn't that non-Latin script should have special handling, rather that it should be passed through rather than removed (and especially that it shouldn't be removed as a default option). It's worth mentioning that in its current state, slugify doesn't even handle fully-Latin-alphabet text properly — for example, various alphabetic chars like [ßĐæø] are removed (Blöße becomes bloe, Trần Hưng Đạo becomes tran-hung-ao, Nærøy becomes nry).
I only started looking into transliteration, which IMO is a less-good option compared to pass-through (not to mention significantly less common in-the-wild), as an alternative.
With all that said... I'm inclined to think you're probably right. Further, @lowlighter 's suggestion of a strip regex is a useful option after all, but with suggested regexes being exported from the package itself (roll-your-own is probably less useful).
With that option, you can easily implement pass-through (default), strip, strip-diacritics, or even strip-only-ascii-diacritics behavior. The regex would be run against the NFD form so it could easily deal with diacritics:
export const NON_WORD = /[^\p{L}\p{M}\p{N}\-]+/gu;
export const DIACRITICS = /[^\p{L}\p{N}\-]+/gu;
export const ASCII_DIACRITICS = /(?<=[a-zA-Z])\p{M}+|[^\p{L}\p{M}\p{N}\-]+/gu;
export const NON_ASCII = /[^0-9a-zA-Z\-]/g;
// NON_WORD
assertEquals(slugify("déjà-vu"), "déjà-vu");
assertEquals(slugify("Συστημάτων Γραφής"), "συστημάτων-γραφής");
assertEquals(slugify("déjà-vu", { strip: DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: DIACRITICS }), "συστηματων-γραφης");
assertEquals(slugify("déjà-vu", { strip: ASCII_DIACRITICS }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: ASCII_DIACRITICS }), "συστημάτων-γραφής");
assertEquals(slugify("déjà-vu", { strip: NON_ASCII }), "deja-vu");
assertEquals(slugify("Συστημάτων Γραφής", { strip: NON_ASCII }), "-");
Further, you could easily use a third-party transliteration library along with strip: NON_ASCII:
import transliterate from 'npm:any-ascii'
assertEquals(slugify(transliterate("Συστημάτων Γραφής"), { strip: NON_ASCII }), "systimaton-grafis");
Ah ok. strip option sounds good to me. Looks like a balanced solution between added complexity and practicality.