fsrx
fsrx copied to clipboard
Unicode-based split of words and graphemes
What:
- Inputs with non-latin alphabets are now processed correctly.
Why:
- The original regex
[\w\\']+
only matches latin alphabets, thus non-latin inputs were not processed at all. - The
&str.len()
is not always the same as the length of the Unicode graphemes, and indices instyle_substr
were calculated wrongly for multibyte characters.
How:
- Changed the word-split algorithm from regex to
unicode_word_indices
- Changed the character indexing from
&str
slice toUnicodeSegmentation::graphemes
Tests:
- English
echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrx
- French
echo "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrx
- Korean
echo '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrx
Checklist:
- [x]
Allow edits from maintainers
option checked - [x] Branch name is prefixed with
[your_username]/
(ex.coloradocolby/featureX
) - [x] Documentation added
- [x] Tests added
- [x] No failing actions
- [x] Merge ready
Caveat:
- I comfirmed that languages putting spaces between words are processed quite similarly to English. However, it seems that
fsrx
's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.) - Some terminal emulators (e.g., Alacritty, if I remember correctly) may not properly support Unicode input / output. I tested my code with
xfce4-terminal
. - For non-latin alphabets, I tested my code with
D2Coding
font, but other fonts such as Noto will work as well.
damn @ichianr this looks amazing. I don't have time rn to look it over but will tonight!