linenoise
linenoise copied to clipboard
Another implementation for utf8 support.
Since I saw #186, I would like to post my implementation for utf8 support, too. It supports Emoji and CJK characters which have wide width as well. When the Unicode standard gets updated, the only thing we need is to update tables in `encodings/utf8.c'. We can also easily add another encodings by implementing by providing the following functions:
typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);
You can see more detail from my comments in #25.
I don't have any intention to promote my implementation at all with this pull request. But hope this implementation helps you when designing the utf8 support in linenoise in addition to @ManzoniGiuseppe's excellent implementation.
Thank you!
Nah, mine was just a temporary fix for basic utf-8 support until someone wrote something more complete (which apparently was already written long ago). Anyway, I tried yours out, and I found two things. It seems I can't create issues in yours, so, I'll put them here:
- The wideCharTable seems bugged to me. Things like °, ¡, €, ¤, ×, ¹, á, ¶ are shown as single column in my xterm, but your impl considers them as two columns large. Maybe it's my xterm to be bugged, but consider the following: Your impl considers the lower case á to be two columns wide, but its uppercase Á is a single column... same strange thing for the pairs ǎ/Ǎ, à/À, while ã/Ã, ä/Ä, ȧ/Ȧ are all single column.
- You put this codepoint in your example, and you said to support unicode 12.1, but the Zero Width Joiner (U+200D, added in unicode 1.1.5) is not supported width-wise. I mean, it's neither a zero width nor a joiner.
@ManzoniGiuseppe, thanks for the thorough checking!
The wideCharTable seems bugged to me. Things like °, ¡, €, ¤, ×, ¹, á, ¶ are shown as single column in my xterm, but your impl considers them as two columns large. Maybe it's my xterm to be bugged, but consider the following: Your impl considers the lower case á to be two columns wide, but its uppercase Á is a single column... same strange thing for the pairs ǎ/Ǎ, à/À, while ã/Ã, ä/Ä, ȧ/Ȧ are all single column.
You are right. My the wideCharTable generator (generate_wide_char_table.rb) treated 'Ambiguous' (A) entries as Wide chars. The new generator code now excludes them from wideCharTable. Please check with the latest code.
The reason why I considered Ambiguous entries as Wide chars is based on the following information from UAX #11 'East Asian Width' document for the Ambiguous characters:
In a broad sense, wide characters include W, F, and A (when in East Asian context), and narrow characters include N, Na, H, and A (when not in East Asian context).
But it seems like there are only a few cases where the ambiguous entries should be treated as wide chars. So I think this change is ok, and hope it fixes the problems that you reported.
You put this codepoint in your example, and you said to support unicode 12.1, but the Zero Width Joiner (U+200D, added in unicode 1.1.5) is not supported width-wise. I mean, it's neither a zero width nor a joiner.
In the EastAsianWIdth.txt in Unicode 13.0 Database has the following entry:
200B..200F;N # Cf [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK
U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point...
Anyway I didn't make any decision by myself as to whether characters should be wide or narrow, rather the Unicode East Asian Width database does all. The benefit of this approach is very easy to catch up to the latest Unicode standard. (Just download UnicodeData.txt and EastAsianWidth.txt from Unicode.org, and re-generate combiningCharTable and WideCharTable with the ruby scripts.)
But if there are cases that this auto-generated-table-driven way cannot handle, it seems like such cases should be handled manually...
I included tools directory where you can see the latest Unicode 13.0 UnicodeData.txt and EastAsianWidth.txt and ruby scripts to generate the C++ tables. Hope it helps you to understand my code better. :)
U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point...
N means neutral, not narrow. U+200D is a default ignorable codepoint, so in the context of a terminal it should be zero width.
@craigbarnes, thanks for finding my incorrect understanding for N. :)
It's the same difference really. It's the default ignorable part that matters. The most straight forward way to support those codepoints is to parse DerivedCoreProperties.txt and make all Default_Ignorable_Code_Point entries have a width of 0.
Hey folks, thanks for your efforts. What do you think the best support is so far among the ones contributed? I would like to spend some time to review.
@antirez, thank you for your interest in adding UTF-8 support.
I don't say that my implementation is the best though, you might find something helpful in it when supporting UTF-8 in linenoise.
Here is the diff between the latest master and my branch: https://github.com/yhirose/linenoise/commit/72bf4bb031c913b06e697ce47a70350649c0595a#diff-72af1757e65272a44fc4aa3b479e2f3b
Most changes in 'linenoise.c' are basically for calculation of correct column position for multi-bytes characters. (Size of a character could now be multi-bytes, and it also could be a 'wide' char which consumes 2 column widths like Emoji and CJK characters.)
The actual UTF-8 specific code is separated in `encodings/utf8.h and c'. It implements the following interfaces which are used in 'linenoise.c'.
typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);
So if you want to support another encoding, what we need to do is to implement the above interfaces for the encoding and integrate them to linenoise by calling linenoiseSetEncodingFunctions function in 'linenoise.h'.
Hope it helps when you review the code. Please let me know if you have any questions. Thank you!
Of the two I know of, this has an obviously better utf-8 support then mine. If anything I think a function to detect if utf-8 is supported would be nice and easy. For example mine internally (didn't want to change the API) checks it by printing a U+1EE4 and detecting of how many positions the cursor was moved. Many other encodings like all ISO 8859,Windows-1252, and Mac OS Roman would print it as 3 valid chars instead of 1. I think it'd be useful for many applications as they may not know which encoding their terminal is using.
Ok I took some time to read both the implementations. They are both nice in different ways. But my feeling is that this implementation is too complex for the linenoise design idea of minimality, and @ManzoniGiuseppe implementation, which is more in the right direction, is too conservative in the abstractions, so it ends making a lot of math with buffer positions. I want to attempt a more minimal implementation, and I'll post it as a PR as well. Let's see if there is a possible, really minimal approach. Thanks.
@antirez, sounds good to me!
In case anyone cares, I've updated this branch to work with the multiplexing API / resolved the current merge conflicts. https://github.com/yhirose/linenoise/pull/1
I have rebased the utf8-support branch on yhirose/linenoise with the latest master branch, and I confirmed that the utf8-support branch works with Japanese and Emoji.
Rebased the utf8-support branch on yhirose/linenoise with the latest master branch.