heck icon indicating copy to clipboard operation
heck copied to clipboard

Add support for CJK characters

Open pickfire opened this issue 3 years ago • 17 comments

Fix #12

pickfire avatar Jan 05 '21 16:01 pickfire

You can just disable the clippy lint about matches!, using that would raise the MSRV quite a bit.

jplatte avatar Jan 05 '21 17:01 jplatte

@jplatte I just changed it to use matches!(.

pickfire avatar Jan 05 '21 17:01 pickfire

Independently of that, please consider boats' concerns from #13. What would be your answers to them? I'm already confused about the one special case in lowercasing¹ already present in the code, and this adds a bunch more special cases. I totally understand the desire for this to work with CJK chars though, which AFAICT from #12 and #13 is not at all the case right now.

¹ edit: looked it up, not complicated but still of course very uncommon knowledge

jplatte avatar Jan 05 '21 17:01 jplatte

Regarding correctness, I just saw that but I think that having the patch is way more correct than not. Not having it entirely breaks chinese characters with - in between each characters, that's basically joining each character with - so I don't think that's a good idea. Imagine this crate does conversion from hello world to h-e-l-l-o-w-o-r-l-d in english, it does not seemed to be correct.

Regarding maintainability, I wanted to see other crates doing this, I saw that there are cjk and unicode-blocks (I just saw this one), and those crates are also using different ranges, some are even drawing characters and some are missing ranges.

The nearest I would say is use unicode blocks but with our own ranges to determine whether a language is continuous script, since that crate includes IDEOGRAPHIC_DESCRIPTION_CHARACTERS but does not include Thai/Japanese language which does not have the concept of word boundary IIRC.

I would say that CJK itself is complicated, there are thousands of Han characters (and imagine I can only recognize roughly 2-8 thousands of it) it is even more complicated when it is being allocated in different blocks in unicode.

pickfire avatar Jan 05 '21 17:01 pickfire

I think I'd be okay merging this if I could verify easily enough that these unicode ranges are in fact for blocks where you would otherwise get splits between every pair of codepoints.

Also, can you recommend a font that has glyphs for all of these?

jplatte avatar Jan 10 '21 13:01 jplatte

I think I'd be okay merging this if I could verify easily enough that these unicode ranges are in fact for blocks where you would otherwise get splits between every pair of codepoints.

I would say these covers most of them, in fact some of the blocks contains ancient characters which are not even in the font but I added those just in case. I am not awere of any sites or articles or code saying that these are the correct unicode ranges, it could be non exhaustive like rust #[non_exhaustive] since newer blocks could be added in the future.

Also, can you recommend a font that has glyphs for all of these?

No, I am not awere of any fonts that covers all the unicode glyphs. I am using Noto Sans CJK SC which covers everything I need, I used to use WQY Microhei because I find it more beautiful but some characters are not covered, like the Korean fonts are rendered in a weird way. But still, if one uses Noto Sans, they should use the one for their most used locale, like SC/TC/JP/KR/HK since there are minor differences between them (mainly the style of strokes), it may not be optimal to use SC when reading Japanese but it is a compromise since some parts may not specifically mention that the part is written in Japanese so it still still use Simplified Chinese font to display.

pickfire avatar Jan 11 '21 02:01 pickfire

I think I'd be okay merging this if I could verify easily enough that these unicode ranges are in fact for blocks where you would otherwise get splits between every pair of codepoints.

I would say these covers most of them, in fact some of the blocks contains ancient characters which are not even in the font but I added those just in case. I am not awere of any sites or articles or code saying that these are the correct unicode ranges, it could be non exhaustive like rust #[non_exhaustive] since newer blocks could be added in the future.

I don't care so much about the list being exhaustive (that was not what I was talking about), but there has to be some way to verify the ranges added need this special treatment to avoid dashes between every code point.

if one uses Noto Sans, they should use the one for their most used locale, like SC/TC/JP/KR/HK since there are minor differences between them (mainly the style of strokes)

Thanks.

jplatte avatar Jan 11 '21 13:01 jplatte

I don't care so much about the list being exhaustive (that was not what I was talking about), but there has to be some way to verify the ranges added need this special treatment to avoid dashes between every code point.

I am not aware of aware of any method. We could do it the opposite way, we can check if the code is part of a continuous script language, such as English, Spannish, Russian, but that would be a lot more I guess.

pickfire avatar Jan 11 '21 14:01 pickfire

Well, how did you come up with these ranges?

We could do it the opposite way, we can check if the code is part of a continuous script language, such as English, Spannish, Russian, but that would be a lot more I guess.

That would presumably include ranges for things like mathematical symbols, emojis and arrows, right? I guess if you were to apply a case transformation to a bunch of emojis in a row, you wouldn't necessarily expect word splits in between but I'm not quite sure.

jplatte avatar Jan 11 '21 14:01 jplatte

Well, how did you come up with these ranges?

I checked out other implementations and see the different ranges in wikipedia.

That would presumably include ranges for things like mathematical symbols, emojis and arrows, right? I guess if you were to apply a case transformation to a bunch of emojis in a row, you wouldn't necessarily expect word splits in between but I'm not quite sure.

Yes, I am also not sure how do we handle the rest, like symbols and stuff.

pickfire avatar Jan 11 '21 14:01 pickfire

I checked out other implementations and see the different ranges in wikipedia.

If there is one / many Wikipedia pages where one can easily correlate these scripts with their unicode ranges and verify they are in fact continuous scripts, could you include links in the comments? Also, could you add a brief explanation of the term "continuous script"?

jplatte avatar Jan 11 '21 14:01 jplatte

If there is one / many Wikipedia pages where one can easily correlate these scripts with their unicode ranges and verify they are in fact continuous scripts, could you include links in the comments? Also, could you add a brief explanation of the term "continuous script"?

https://en.wikipedia.org/wiki/Scriptio_continua Oh, I used it wrongly here. Thanks for asking, it is supposed to be languages that does not support spaces, and most of them (it does not mean all) does not have the concept of casing. Let me update.

pickfire avatar Jan 11 '21 15:01 pickfire

I have seen that Wikipedia page, but all things listed there are listed as examples so could be replaced. Also it doesn't mention unicode ranges. I would expect the code to have links to pages to cross-reference these ranges for verification. It shouldn't be necessary to hunt for this information to verify correctness.

May I ask what your use case for this is, by the way?

jplatte avatar Jan 11 '21 15:01 jplatte

I have seen that Wikipedia page, but all things listed there are listed as examples so could be replaced. Also it doesn't mention unicode ranges. I would expect the code to have links to pages to cross-reference these ranges for verification. It shouldn't be necessary to hunt for this information to verify correctness.

https://en.wikipedia.org/wiki/CJK_Unified_Ideographs look at the bottom part, the ranges there only include part of it, we also need to cover bopomofo (chinese), katagana (japanese), hiragana (japanese) and thai. Korean should be there but I think they separate the characters nowdays when I read up, I don't know korean but a quick search say they usually do separate the characters by spaces.

https://en.wikipedia.org/wiki/Korean_language Modern Korean is written with spaces between words, a feature not found in Chinese or Japanese (except when Japanese is written exclusively in hiragana, as in children's books). Korean punctuation marks are almost identical to Western ones. Traditionally, Korean was written in columns, from top to bottom, right to left, but it is now usually written in rows, from left to right, top to bottom.

May I ask what your use case for this is, by the way?

I don't need it but the one who opened the issue need it, they even opened a pull request but that is using a different method compared to this. I just came across this issue and I think it should be good to handle it since I use CJK quite a bit. Their issue was opened since 2018 and wasn't fixed.

pickfire avatar Jan 11 '21 15:01 pickfire

You could look at the Unicode Script property of characters to determine if they need different treatment, instead of hardcoding ranges. Although that might mean another dependency (for example: https://github.com/unicode-rs/unicode-script).

Additionally, apparently you can look at the "LB letters" field of the CLDR's script metadata (JSON version here) to find the identifiers for scripts that are continuous. There's a more human-readable spreadsheet here or here. In the latter there is a note that says:

Basically scripts like Thai and Chinese that don't use spaces between words.

That gives you a more easily automated and less fragile basis.

Then, instead of just doing no processing when these scripts are present, you could do sentence segmentation, either on the whole string, or just on the relevant parts (groups of contiguous characters with those scripts) and use word segmentation on the others. Examples:

  • Source: 你好。我叫John Smith,很高兴认识你。
  • Sentence segmentation on whole string: 你好-我叫johnsmith很高兴认识你
  • Mixed sentence + word segmentation: 你好-我叫-john-smith-很高兴认识你

Any leftover punctuation from sentence segmentation would need an extra step to be removed, which you could do by running it through word segmentation afterwards then joining back together.

noinkling avatar Feb 28 '21 14:02 noinkling

Thanks a lot, I didn't know about the "LB letters". But doesn't that requires detecting what language is the character for each characters? Wouldn't that be costly?

Do you know any other word similar libraries (even in other languages) that implements this?

I think this needs opinion on @jplatte, wonder what he thinks.

pickfire avatar Feb 28 '21 17:02 pickfire

I think "Mixed sentence + word segmentation" from the examples above would be ideal. I don't think perf should be a concern here, unless somebody can show me or at least think of a scenario where you'd want to run case conversion on a large body of text.

The only use cases I actually know of are

  • proc-macros (serde(rename_all) kind of things)
  • generating "slugs" for titles

both of which don't ever operate on large enough input for case conversion to make a noticable perf impact. As to the extra dependency we might need for this, it seems warranted if the input can contain non-ASCII characters and for the proc-macro use case I've been wanting to make the existing unicode-segmentation dependency optional anyways (there's just too many other things on my plate at the moment so I haven't gotten to it; see also #26).

jplatte avatar Mar 02 '21 15:03 jplatte

#45 added better support for CJK characters in a simpler way. Let me know if you see any issues with it (will release a new rc version with it soon-ish).

jplatte avatar Aug 15 '23 08:08 jplatte