iota icon indicating copy to clipboard operation
iota copied to clipboard

Fix unicde support

Open gchp opened this issue 9 years ago • 4 comments

Since #50 was merged, Unicode support is broken. @P1start mentioned in the comments that fixing this shouldn't be too involved.

gchp avatar Dec 29 '14 18:12 gchp

There's actually two issues:

  • Showing buffer content as ISO8859-1 text (showing one character per byte?)
  • Misinterpreting unicode input as ASCII characters. For example, йцукенгшщзхъфывапролджэ\ячсмитьбю. is interpreted as 9FC:5=3HI7EJDK20?@>;46M\OGA<8BL1N..

suhr avatar Feb 16 '15 14:02 suhr

#95 (specifically https://github.com/crespyl/iota/commit/e643737a449851aa068f9e7a5fca8a528d7181b5) has some changes that should hopefully fix unicode rendering (it seems to work for the minimal cases in the buffer.rs tests section). I'm not sure what to do about input; does termbox work with unicode in the first place, and might we need to fix rustbox?

crespyl avatar Feb 16 '15 22:02 crespyl

From @crespyl on Gitter:

due to the nature of UTF-8, the nth char in a buffer is not necessarily at the nth byte it should be possible to use something like self.chars().indices().take(n).last().map(|(byte_index, character)| byte_index) to correctly handle multi-byte characters

Related to cursor movement over multi-byte characters.

gchp avatar Feb 23 '15 12:02 gchp

I've been messing around with trying to add unicode support, and it is turning out to be complicated. The biggest problem I have found is that termbox expects each cell to be a single codepoint, even though there sometimes needs to be multiple codepoints per cell. It probably wouldn't be too hard to modify termbox to store each cell as an array of chars rather than a single char, although it would take away some of the simplicity of the library. And, of course, UIBuffer would also have to do this as well.

I think that some problems could be solved by using iterators over cells (where 1 cell = 1 character width) rather than over bytes, chars, or graphemes. For example, an iterator yielding Option<&str> which, for each grapheme, yields the grapheme first and then yields None for each extra character-width the grapheme takes up.

I'm guessing it would be easiest to have Buffer be an abstraction layer for all the byte-level stuff and let every other part of the code deal in characters/graphemes. This would, of course, require heavy changes to the interface of Buffer... but so would changing the data structure backing it, which might inevitably happen anyways.

In summary, it seems like an implementation of unicode support could start from two places: termbox and Buffer.

Fixing the display of @suhr's example text wasn't too hard, but the fix shows why it is probably important not to make code outside of Buffer deal with data on the byte level.

[see spaghetti code here] [see screen shot here]

ghost avatar May 13 '15 03:05 ghost