transliterate icon indicating copy to clipboard operation
transliterate copied to clipboard

Support languages like Chinese, Japanese, Thai, etc.

Open saginadir opened this issue 6 years ago • 19 comments

It's a cool library, but i'm fearful that it won't slugify everything.

Chinese characters are just deleted.

slugify('你好'); // results in an empty string

saginadir avatar May 08 '18 07:05 saginadir

I'm curious, what would be the preferable result in this case?

Sigmus avatar May 08 '18 10:05 Sigmus

It is possible to convert Chinese to pinyin for example: https://stackoverflow.com/questions/4813086/how-to-convert-chinese-characters-to-pinyin

saginadir avatar May 09 '18 11:05 saginadir

Reading that answer it appears that there is no single way to slugify Chinese characters. Even converting them to Pinyin it would be very hard to provide the correct conversion, as the last answer in the question you linked to points out.

If you have the translations handy you can add them to your project and then slugify the translation. That would probably be easier than asking slugify to also convert from one language to the other. I believe that's outside the scope of what the library was designed to do

caraya avatar Jun 11 '18 01:06 caraya

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

XieJiSS avatar Feb 13 '19 05:02 XieJiSS

Wikipedia URLs contain unicode characters in their paths, so I figured that was OK and I was looking for a lib to do the same for my non-English site.

waynebloss avatar Mar 27 '19 18:03 waynebloss

Could we just leave CJK characters unchanged? Like Hello你好 -> hello-你好

PR welcome for an opt-in options for it.

sindresorhus avatar Mar 27 '19 18:03 sindresorhus

mark

lizhengnacl avatar Jan 08 '20 15:01 lizhengnacl

@sindresorhus Yeah, "Ignores Chinese" is a bad title.

The Japanese get no love either. 残念。。。

brandonpittman avatar Feb 18 '20 03:02 brandonpittman

@brandonpittman I definitely intend to support languages like Chinese, Japanese, Thai, etc, but it's more work and will take some time. Help is always welcome though.

sindresorhus avatar Feb 18 '20 06:02 sindresorhus

If anyone wants to work on this, see the feedback given in https://github.com/sindresorhus/slugify/pull/30.

sindresorhus avatar May 07 '20 06:05 sindresorhus

We are currently using https://www.npmjs.com/package/transliteration but I'd love to use this library instead. Even basic/minimal support for Chinese/Japanese characters would be good enough for what we need.

alfaproject avatar Sep 15 '21 06:09 alfaproject

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all & & would be converted to Ni, resulting the same slug.

xiao99xiao avatar Oct 22 '21 14:10 xiao99xiao

A little tip about the idea of converting Chinese to Pinyin like 你好 to Nihao:

Conversion to Pinyin could never be 100% accurate, but for most cases, they are totally fine to use as slugs.

But, if the generated slugs are expected to be unique, then Pinyin is not good idea. Because it's highly possible that completely different Chinese characters gets converted to the same Pinyin. For example, all & & would be converted to Ni, resulting the same slug.

as the original author of this issue, this popped up in my email. I read & write in basic Chinese.

I have 2 thoughts about this:

  1. Who said a slug has to be unique? most of the time, slugs are an additional way to represent the text to an ascii only system and it doesn't necessarily has to be reversed back to the utf8 format.

  2. I can float some ideas of making unique slugs if they are needed. For example 你 could be changed into ni3 which is pinyin + tone. 你好 can be ni3hao3 - still a perfectly valid slug. Another way to make it unique is to use stroke number for example: 你好 would be ni7hao6. Still not unique enough? how about a mix of the two: ni37hao36. Using this format, I still can't guarantee uniqueness - because my input can be the same from 2 different sources but it'll be better than just a pure nihao slug.

saginadir avatar Oct 24 '21 06:10 saginadir

  1. Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

xiao99xiao avatar Oct 24 '21 07:10 xiao99xiao

  1. Who said a slug has to be unique?

I didn`t. I mean for most cases, they are totally fine to be used as slugs unless unique is required which totally depends on actual use cases. The reason I mentioned this is that I noticed that the current slugify process for supported languages produces unique slugs, though it might be just an unintended side effect.

Another way to make it unique is to use stroke number

Is a good idea to reduce the chance of coincidence.

What I can say is that I needed slugs for URLs. For example someone writes a post titled “我的冬季“ or something like that. So instead of having a URL with an ID like this: mywebsite.com/post/421321812131 you can make it nicer + nicers for SEO like this: mywebsite.com/post/wo-de-dong-ji. uniqueness can be solved by appending the ID: mywebsite.com/post/wo-de-dong-ji-421321812131

I guess everyone will have a different use case.

I've already started looking into developing a unique solution with strokes and tones. But this will be just for fun and will be a heavy library which most likely won't be front-end friendly.

saginadir avatar Oct 24 '21 07:10 saginadir

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request https://github.com/sindresorhus/slugify/pull/30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

nhoizey avatar Jan 06 '22 23:01 nhoizey

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example: image

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

RiddMa avatar Jan 03 '24 03:01 RiddMa

Hey it's the year of 2024 and I think a bit of extra tech can be used.

I made a GPT for slugify-ing any Chinese text for my blog: https://chat.openai.com/g/g-1jvs433lo-slugifyzhuan-jia

Example: image

I've posted the prompt as a gist here so everyone can reproduce and edit it.

Hope this helps in some way.

It's an interesting idea indeed :-)

saginadir avatar Jan 04 '24 09:01 saginadir

Can we add other languages like https://en.wikipedia.org/wiki/Tifinagh (for Berber languages) to this issue, or is it only related to Asian languages?

The solution to allow for some untouched unicode ranges (provided in pull request sindresorhus/slugify#30 that was closed) would be enough for my needs, but I understand it can be a bit difficult to use.

Here, the range would be 2D30—2D7F: https://unicode-table.com/en/blocks/tifinagh/

This URL has changed to https://symbl.cc/

deltoro05 avatar Apr 02 '24 14:04 deltoro05