iota
iota copied to clipboard
Fix unicde support
Since #50 was merged, Unicode support is broken. @P1start mentioned in the comments that fixing this shouldn't be too involved.
There's actually two issues:
- Showing buffer content as ISO8859-1 text (showing one character per byte?)
- Misinterpreting unicode input as ASCII characters. For example,
йцукенгшщзхъфывапролджэ\ячсмитьбю.
is interpreted as9FC:5=3HI7EJDK20?@>;46M\OGA<8BL1N.
.
#95 (specifically https://github.com/crespyl/iota/commit/e643737a449851aa068f9e7a5fca8a528d7181b5) has some changes that should hopefully fix unicode rendering (it seems to work for the minimal cases in the buffer.rs tests section). I'm not sure what to do about input; does termbox work with unicode in the first place, and might we need to fix rustbox?
From @crespyl on Gitter:
due to the nature of UTF-8, the nth char in a buffer is not necessarily at the nth byte it should be possible to use something like
self.chars().indices().take(n).last().map(|(byte_index, character)| byte_index)
to correctly handle multi-byte characters
Related to cursor movement over multi-byte characters.
I've been messing around with trying to add unicode support, and it is turning out to be complicated. The biggest problem I have found is that termbox expects each cell to be a single codepoint, even though there sometimes needs to be multiple codepoints per cell. It probably wouldn't be too hard to modify termbox to store each cell as an array of char
s rather than a single char
, although it would take away some of the simplicity of the library. And, of course, UIBuffer
would also have to do this as well.
I think that some problems could be solved by using iterators over cells (where 1 cell = 1 character width) rather than over bytes, chars, or graphemes. For example, an iterator yielding Option<&str>
which, for each grapheme, yields the grapheme first and then yields None
for each extra character-width the grapheme takes up.
I'm guessing it would be easiest to have Buffer
be an abstraction layer for all the byte-level stuff and let every other part of the code deal in characters/graphemes. This would, of course, require heavy changes to the interface of Buffer
... but so would changing the data structure backing it, which might inevitably happen anyways.
In summary, it seems like an implementation of unicode support could start from two places: termbox and Buffer
.
Fixing the display of @suhr's example text wasn't too hard, but the fix shows why it is probably important not to make code outside of Buffer
deal with data on the byte level.