dex
dex copied to clipboard
Strange editor behaviour with UTF-8 text
I'm not sure exactly what the problem is, but certain UTF-8 characters cause the editor to behave strangely. A simple way to reproduce the problem is to insert สี
at the beginning of a line, then add a few ASCII characters to the right, move into the middle of those character and press backspace a few times. This deletes the character 1 to the left of the expected character. An extra space also appears at the end of the line that was never inserted.
For each extra สี
added to the beginning of the line, the backsapce offset moves 1 to the left and 1 more mysterious space is added at the end of the line.
I guess this is an error calculating display width somewhere. The example character I used above is a 3 byte character (ส
; UCS4: U+0E2A
; UTF-8: 0xE0 0xB8 0xAA
), followed by a 3 byte combining character (UCS4: U+0E35
; UTF-8: 0xE0 0xB8 0xB5
), that together should occupy 1 column.
Just for testing purposes, I tried adding 0x0e35
to be recognized as a combining character in u_is_combining()
. That seems to fix the problems I descibed above but reveals a few other quirks when navigating or deleting the characters.
There's no combining character support yet (only a hack in u_char_width() which makes the situation tolerable). Fixing this could be too much work. Also the tables in unicode.c should be generated, not hardcoded.
Yes, it seems to only happen with Thai characters, so same as #34 . Some editors do this correctly, notably JED. jupp, vim, nano, pico, le and joe. Others have the same problem as dex, like e3 and diakonos,
I hope you will be able to fix this, as dex has a lot of potential, being not as byzantine as JED, jupp and joe (and vim) but way more capable than nano and pico.
Yes, it seems to only happen with Thai characters, so same as #34
I don't think it's only Thai characters -- it seems to be combining characters in general. In the UTF-8 demo file you mentioned in issue #34, line 57 also has the same problem.
Right! So it's any combining characters.
@tihirvon I think the DerivedCombiningClass.txt file from the official Unicode 8 character database has all the information needed. It's only ~150KB, whereas the full UCD is ~1.6MB. It seems fairly easy to parse with AWK or a small C program.
It seems that any character in the Grapheme_Extend
category should be considered "zero-width" for the purposes of cursor positioning. GLib has a g_unichar_iszerowidth()
function that returns true
for any character in this category.
I'm willing to work on this issue, but I have a few questions before I start...
- Is it ok to commit
DerivedCombiningClass.txt
to the repo? - Is it ok to use (portable) AWK for parsing the database and generating the lookup table?
- How much more work do you think there is to do, besides fixing/replacing
u_char_width()
?
Fantastic initiative for trying to tackle this!! Within the github ecosystem it should be easy to make a new branch and try things out, right?
Indeed it seems that (currently) only characters marked Mn are zero-width, in the current list only 470 characters. But if the GLib function g_unichar_iszerowidth() doesn't have too much overhead, it makes sense to use it.
For Backspace-deleting these, the unicode character gets removed, but the cursor stays rendered in the same place. For (position) Deleting, the zerowidth character plus the preceding non-zerowidth 'carrier character' get removed.
There's also the issue of selections. It shouldn't be allowed to split a Grapheme_Base
character from it's adjacent Grapheme_Extend
characters, e.g. via a select and cut operation. Then there's the issue of:
not all terminals support zero-width rendering of zero-width marks
... as mentioned in the GLib docs.
I guess for terminals that don't support zero-width rendering of zero-width marks this whole unicode rendering business is irrelevant, and that means for dex this isn't an issue that we can deal with.
Indeed, selections need to always work on the 'rendered (combined) character' level. That's also what I was trying to refer to with "(position) Deleting", the selection of a combined character includes both the 'carrier character' and the following zero-width extension.
Any idea which files in the dex source tree need to be looked at?
Any idea which files in the dex source tree need to be looked at?
@pepa65
It's in unicode.c
.
After reading the code some more, I'm not sure if anything as elaborate as GLib/custom unicode handling is even required here. The only place where u_is_combining()
is called is in u_char_width()
.
Re-implementing the u_char_width()
function using just wcwidth(3)
seems to improve things a lot:
#define _XOPEN_SOURCE
#include <wchar.h>
int u_char_width(unsigned int u) {
return wcwidth((wchar_t)u);
}
Cursor movement, line lengths and backspace seem to mostly work.
wcwidth(3)
should be available/correct on every platform dex cares about and there's a fairly small, portable implementation available at https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c if not.
I'd love to hear from @tihirvon any insights he has about what else may need to be fixed, but I'm not sure if he's still interested in working on dex as open source.
That sounds really good @craigbarnes ! Thank you for returning to contribute, I still think dex has a lot of potential.
What is the #define _XOPEN_SOURCE
for?
What is the #define _XOPEN_SOURCE for?
The wcwidth(3)
function is specified in X/Open Portability Guide issue 4 (XPG4). The #define
is needed to expose symbols from that spec.
See: man 3 wcwidth
and man 7 feature_test_macros
.
I've updated https://github.com/pepa65/dex and it seems to work really well! (Now I need to retrieve my old dex config from the old laptop..!)
I just submitted a more comprehensive patch as #42, but it seems cursor movement still isn't fixed for combining chars.