kakoune
kakoune copied to clipboard
Arch Linux Arm 2021-11 unable to refresh screen with unknown characters
Question
https://user-images.githubusercontent.com/11535575/170097967-78745e43-4955-4577-aa26-7c5503592f82.mp4
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Obviously the above is the current locale
even before the shown behavior. I'm just posting locale
as a reference here.
There are (I think) two issues here and allow me to explain.
But before getting there, this is the same hardware architecture but running instead under archlinuxarm and not under any Ubuntu derivative distro.
There was an issue I opened a while back, in which @Screwtapello kindly explained it further with the locale
settings and the sole objective of kakoune with other modern operating systems and such. He also wrote about other modern terminals (I don't think is necessary to link it outline it here though). I just wanted to clarify that part.
So the first issue here is more than obvious, and that's the reason of the screencast.
But before getting there. There are two things.
The contents of this file, although at the beginning, it wasn't even a file, but a formation of human readable characters, that were originally yanked through the X system clipboard. So going by this fact and if I had to guess, kakoune rightly so, has nothing to do with these operations. Yanking and pasting is done with other tools.
I surely don't remember if pasting was done here specifically with xclip
but is highly likely this may have been the case.
It doesn't happen with say xfce4-terminal but it surely happens with modern terminals such as foot-terminal.
https://user-images.githubusercontent.com/11535575/170106105-1e355a52-532d-469c-a551-1fcece2143ab.mp4
Before going any further, I can't reproduce it with kak -n
~~Disregard this issue.~~
It's simply unacceptable for me, that by just having addhl global/ wrap
on the config file, this behavior occurs.
I had to scratch off the «disregard this issue» part. There must be something else which I'm unaware of.
With kak -n
all the characters appear. Also by commenting out addhl global\ wrap
. I can have a screencast confirm it. But it's not important really.
But as soon as the addhl global\ wrap
is specified on the config file, all of the above as it was first reported happens.
There must be something else going on. (As I said before), besides kak.
But if something as simple (don't you think? and this is not a question begged statement), but if something as simple as an addhl global/ wrap
is further contributing to it, one might have to rethink about a re-implementation here. That's my opinion.
I'm compiling now the most recent commit from a few days ago, and hopefully, hopefully it will eventually finish. Which it did, just before I finished writing the prior sentence. And it shows the same behavior as before.
And by comparing the two instances (sessions) one with addhl global/ wrap
and the other one without it, it seems as if not only \n
is causing havoc, but also ۩
. In other words, and putting aside, (I think) the encoding part of the file, in this particular scenario, using the terminal rather than the editor
Ejemplo
The X11 connection broke (error 1). Did the X11 server die? ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������
The X11 connection broke (error 1). Did the X11 server die? ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������
The X11 connection broke (error 1). Did the X11 server die? ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������
The X11 connection broke (error 1). Did the X11 server die? ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������
kdeinit5: Fatal IO error: client killed ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������
The above was with xterm
Now this one with foot-terminal
Exiting due to channel error. ۩
Exiting due to channel error. ۩
Exiting due to channel error. ۩
Exiting due to channel error. ۩
Exiting due to channel error.
Can you write that scratch buffer to disk and attach it to this issue, so we can see what exact bytes are causing problems? Also, is your wrap command literally just addhl global/ wrap
or are you adding extra flags like -indent
or -marker
?
Can you write that scratch buffer to disk and attach it to this issue, so we can see what exact bytes
How do I do that? Similarly to vim -b <file>
?
adding extra flags like -indent or -marker?
no. Just addhl global/ wrap
How do I do that?
Just :w ~/scratch.txt
, click on the "Attach files by dragging & dropping" text below this comment, and your browser should present a file-picker to let you attach the ~/scratch.txt
file.
The screenshots you've posted don't seem like they'd be affected by anything Vim's binary mode would help with (in particular, the lack of a newline at the end of the file's last line).
OK, I see what's going on here. Here's one of the lines that's actually causing problems:
$ head -20 scratch.txt | tail -1 | hexdump -C
00000000 54 68 65 20 57 61 79 6c 61 6e 64 20 63 6f 6e 6e |The Wayland conn|
00000010 65 63 74 69 6f 6e 20 62 72 6f 6b 65 2e 20 44 69 |ection broke. Di|
00000020 64 20 74 68 65 20 57 61 79 6c 61 6e 64 20 63 6f |d the Wayland co|
00000030 6d 70 6f 73 69 74 6f 72 20 64 69 65 3f 20 20 20 |mpositor die? |
00000040 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
*
000000a0 20 20 db a9 fd bf bf bf bf bf 20 f8 a0 80 82 80 | ........ .....|
000000b0 fd bf bf bf bf bf 20 f8 a0 80 82 80 fd bf bf bf |...... .........|
000000c0 bf bf 20 f8 a0 80 82 80 fd bf bf bf bf bf 20 f8 |.. ........... .|
000000d0 a0 80 82 80 fd bf bf bf bf bf 20 f8 a0 80 82 80 |.......... .....|
000000e0 fd bf bf bf bf bf 20 f8 a0 80 82 80 fd bf bf bf |...... .........|
000000f0 bf bf 20 f8 a0 80 82 80 fd bf bf bf bf bf 20 f8 |.. ........... .|
00000100 a0 80 82 80 fd bf bf bf bf bf 20 f8 a0 80 82 80 |.......... .....|
00000110 fd bf bf bf bf bf 20 f8 a0 80 82 80 fd bf bf bf |...... .........|
00000120 bf bf 20 f8 a0 80 82 80 fd bf bf bf bf bf 20 f8 |.. ........... .|
00000130 a0 80 82 80 fd bf bf bf bf bf 20 f8 a0 80 82 80 |.......... .....|
00000140 fd bf bf bf bf bf 20 f8 a0 80 82 80 fd bf bf bf |...... .........|
00000150 bf bf 20 f8 a0 80 82 80 fd bf bf bf bf bf 20 f8 |.. ........... .|
00000160 a0 80 82 80 fd bf bf bf bf bf 0a |...........|
0000016b
There's a bunch of perfectly reasonable ASCII text, then the bytes DB A9 (the UTF-8 encoding of U+06E9), which renders nicely), then an FD byte and 5 BF bytes.
In binary, 0xFD is 0b11111101, which would be the first byte of a 6-byte UTF-8 sequence. 0xBD is 0b10111111, which is a perfectly legitimate UTF-8 continuation byte. Together, this would be the UTF-8 sequence U+7FFFFFFF if Unicode weren't capped at U+10FFFF.
After an ASCII space (0x20) we have F8 A0 80 82 80, or in binary:
0b11111000
0b10100000
0b10000000
0b10000010
0b10000000
If we decode that as UTF-8, we get the invalid character U+800080. The rest of the line is various repetitions of these two invalid characters, up until a newline character (0x0A).
Here's what I think's going on:
When Kakoune displays a line of text, it estimates how many terminal character cells the text will take up, so it can crop the text at the right hand column, or to leave room for an info box like the g
or v
menus. Once it sends the line of text to the terminal, the terminal does its own measurement to decide how many character cells the text will take up, and Kakoune doesn't have any control over what the terminal does here — it just has to trust that terminal will agree with Kakoune's estimation.
For a character like U+0041 LATIN CAPITAL LETTER A, it is well established that it should take exactly one terminal character cell, so it is extremely likely that Kakoune and the terminal will agree.
For a character like U+2500 BOX DRAWINGS LIGHT HORIZONTAL, the Unicode standard says it could be one or two character cells, but usually the value depends on the operating system's currently-active locale, so Kakoune can ask the operating system and assume the terminal will also ask the operating system.
For a "character" like U+7FFFFFFF, there is no well-established convention, there is no record in the operating system's locale database, and it's not even clear how many characters there are here. A system that does the full UTF-8 decode and then checks against the U+10FFFF limit might decide there's one invalid character; a system that first checks against the 4-byte UTF-8 character limit might decide that FD BF BF BF is one invalid character and the following BF BF are two more invalid characters; a system that first notices that FD is an invalid UTF-8 byte might report six independent invalid characters. On top of that, some systems replace each byte of an invalid character with a U+FFFD REPLACEMENT CHARACTER, some replace each invalid character, and some replace each sequence of invalid characters with a single replacement character.
As you can see, there's a lot of scope for Kakoune, the OS, and the terminal to violently disagree about how many cells U+7FFFFFFF should take up, and when they do, you get weird behaviour like in your screen shots.
People who want a simpler test case can try this:
printf '\n\n\n\n\n\n\n\n\n\n\n\n\n\xfd\xbf\xbf\xbf\xbf\xbf' | kak -e 'exec g'
- in
gnome-terminal
, it displays as "������" and the goto menu appears correctly, but pressingl
on the first character moves directly to the newline afterward - in
xterm
, it displays as a single blank cell and the goto menu is misaligned; if you move the cursor onto the invalid character, it expands to be 6 blank cells and the menu looks OK, butl
moves directly to the newline - in
urxvt
, it displays as a single "no such character" rectangle and the menu is misaligned; if you move the cursor onto the invalid character, it expands to be "ý¿¿¿¿¿" - in
qterminal
it displays as "������" and the goto menu appears no more messed up than it usually does - in
pterm
it displays as "▒" and expands to "▒▒▒▒▒▒" when you move the cursor onto it - in
kitty
it displays as zero-width, even when you move the cursor onto it - in
terminology
it displays as "�" and expands to "������" when you move the cursor onto it
I suppose for maximal predictabliity, Kakoune needs to be more aggressive about not sending invalid UTF-8 to the terminal, even if it's in the buffer, replacing semantically invalid characters with a single "�" in the same way it replaces syntactically-invalid characters (printf '\xfd\xfd' | kak
).
In addition, Kakoune should make sure that h
and l
agree with the display code on what constitutes a "character". It's weird that you can put the cursor on the first "�" in the sequence individually, but l
skips over the entire "������" sequence — as opposed to something like a tab character, where l
skips over the entire thing, but when you put the cursor on the first cell, the entire thing gets highlighted.