dex icon indicating copy to clipboard operation
dex copied to clipboard

Strange editor behaviour with UTF-8 text

Open craigbarnes opened this issue 10 years ago • 16 comments

I'm not sure exactly what the problem is, but certain UTF-8 characters cause the editor to behave strangely. A simple way to reproduce the problem is to insert สี at the beginning of a line, then add a few ASCII characters to the right, move into the middle of those character and press backspace a few times. This deletes the character 1 to the left of the expected character. An extra space also appears at the end of the line that was never inserted.

For each extra สี added to the beginning of the line, the backsapce offset moves 1 to the left and 1 more mysterious space is added at the end of the line.

I guess this is an error calculating display width somewhere. The example character I used above is a 3 byte character (; UCS4: U+0E2A; UTF-8: 0xE0 0xB8 0xAA), followed by a 3 byte combining character (UCS4: U+0E35; UTF-8: 0xE0 0xB8 0xB5), that together should occupy 1 column.

craigbarnes avatar Oct 04 '14 21:10 craigbarnes

Just for testing purposes, I tried adding 0x0e35 to be recognized as a combining character in u_is_combining(). That seems to fix the problems I descibed above but reveals a few other quirks when navigating or deleting the characters.

craigbarnes avatar Oct 14 '14 12:10 craigbarnes

There's no combining character support yet (only a hack in u_char_width() which makes the situation tolerable). Fixing this could be too much work. Also the tables in unicode.c should be generated, not hardcoded.

tihirvon avatar Nov 02 '14 15:11 tihirvon

Yes, it seems to only happen with Thai characters, so same as #34 . Some editors do this correctly, notably JED. jupp, vim, nano, pico, le and joe. Others have the same problem as dex, like e3 and diakonos,

I hope you will be able to fix this, as dex has a lot of potential, being not as byzantine as JED, jupp and joe (and vim) but way more capable than nano and pico.

pepa65 avatar Nov 18 '15 11:11 pepa65

Yes, it seems to only happen with Thai characters, so same as #34

I don't think it's only Thai characters -- it seems to be combining characters in general. In the UTF-8 demo file you mentioned in issue #34, line 57 also has the same problem.

craigbarnes avatar Nov 19 '15 06:11 craigbarnes

Right! So it's any combining characters.

pepa65 avatar Nov 19 '15 10:11 pepa65

@tihirvon I think the DerivedCombiningClass.txt file from the official Unicode 8 character database has all the information needed. It's only ~150KB, whereas the full UCD is ~1.6MB. It seems fairly easy to parse with AWK or a small C program.

It seems that any character in the Grapheme_Extend category should be considered "zero-width" for the purposes of cursor positioning. GLib has a g_unichar_iszerowidth() function that returns true for any character in this category.

I'm willing to work on this issue, but I have a few questions before I start...

  • Is it ok to commit DerivedCombiningClass.txt to the repo?
  • Is it ok to use (portable) AWK for parsing the database and generating the lookup table?
  • How much more work do you think there is to do, besides fixing/replacing u_char_width()?

craigbarnes avatar Dec 01 '15 15:12 craigbarnes

Fantastic initiative for trying to tackle this!! Within the github ecosystem it should be easy to make a new branch and try things out, right?

pepa65 avatar Dec 03 '15 16:12 pepa65

Indeed it seems that (currently) only characters marked Mn are zero-width, in the current list only 470 characters. But if the GLib function g_unichar_iszerowidth() doesn't have too much overhead, it makes sense to use it.

For Backspace-deleting these, the unicode character gets removed, but the cursor stays rendered in the same place. For (position) Deleting, the zerowidth character plus the preceding non-zerowidth 'carrier character' get removed.

pepa65 avatar Dec 03 '15 17:12 pepa65

There's also the issue of selections. It shouldn't be allowed to split a Grapheme_Base character from it's adjacent Grapheme_Extend characters, e.g. via a select and cut operation. Then there's the issue of:

not all terminals support zero-width rendering of zero-width marks

... as mentioned in the GLib docs.

craigbarnes avatar Dec 03 '15 17:12 craigbarnes

I guess for terminals that don't support zero-width rendering of zero-width marks this whole unicode rendering business is irrelevant, and that means for dex this isn't an issue that we can deal with.

Indeed, selections need to always work on the 'rendered (combined) character' level. That's also what I was trying to refer to with "(position) Deleting", the selection of a combined character includes both the 'carrier character' and the following zero-width extension.

pepa65 avatar Dec 03 '15 17:12 pepa65

Any idea which files in the dex source tree need to be looked at?

pepa65 avatar Jan 11 '16 01:01 pepa65

Any idea which files in the dex source tree need to be looked at?

@pepa65

It's in unicode.c.

After reading the code some more, I'm not sure if anything as elaborate as GLib/custom unicode handling is even required here. The only place where u_is_combining() is called is in u_char_width().

Re-implementing the u_char_width() function using just wcwidth(3) seems to improve things a lot:

#define _XOPEN_SOURCE
#include <wchar.h>

int u_char_width(unsigned int u) {
    return wcwidth((wchar_t)u);
}

Cursor movement, line lengths and backspace seem to mostly work.

wcwidth(3) should be available/correct on every platform dex cares about and there's a fairly small, portable implementation available at https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c if not.

I'd love to hear from @tihirvon any insights he has about what else may need to be fixed, but I'm not sure if he's still interested in working on dex as open source.

craigbarnes avatar Jul 21 '17 16:07 craigbarnes

That sounds really good @craigbarnes ! Thank you for returning to contribute, I still think dex has a lot of potential. What is the #define _XOPEN_SOURCE for?

pepa65 avatar Jul 21 '17 17:07 pepa65

What is the #define _XOPEN_SOURCE for?

The wcwidth(3) function is specified in X/Open Portability Guide issue 4 (XPG4). The #define is needed to expose symbols from that spec.

See: man 3 wcwidth and man 7 feature_test_macros.

craigbarnes avatar Jul 21 '17 18:07 craigbarnes

I've updated https://github.com/pepa65/dex and it seems to work really well! (Now I need to retrieve my old dex config from the old laptop..!)

pepa65 avatar Jul 21 '17 18:07 pepa65

I just submitted a more comprehensive patch as #42, but it seems cursor movement still isn't fixed for combining chars.

craigbarnes avatar Jul 22 '17 01:07 craigbarnes