fsrx Unicode-based split of words and graphemes

Unicode-based split of words and graphemes

Open ichianr opened this issue 2 years ago • 1 comments

What:

Why:

The original regex [\w\\']+ only matches latin alphabets, thus non-latin inputs were not processed at all.
The &str.len() is not always the same as the length of the Unicode graphemes, and indices in style_substr were calculated wrongly for multibyte characters.

How:

Changed the word-split algorithm from regex to unicode_word_indices
Changed the character indexing from &str slice to UnicodeSegmentation::graphemes

Tests:

English echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrx
French echo "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrx
Korean echo '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrx

Checklist:

[x] Allow edits from maintainers option checked
[x] Branch name is prefixed with [your_username]/ (ex. coloradocolby/featureX)
[x] Documentation added
[x] Tests added
[x] No failing actions
[x] Merge ready

Caveat:

I comfirmed that languages putting spaces between words are processed quite similarly to English. However, it seems that fsrx's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.)
Some terminal emulators (e.g., Alacritty, if I remember correctly) may not properly support Unicode input / output. I tested my code with xfce4-terminal.
For non-latin alphabets, I tested my code with D2Coding font, but other fonts such as Noto will work as well.

Jun 09 '22 08:06 ichianr

damn @ichianr this looks amazing. I don't have time rn to look it over but will tonight!

Jun 09 '22 17:06 jrnxf