kakoune Arch Linux Arm 2021-11 unable to refresh screen with unknown characters

Question

https://user-images.githubusercontent.com/11535575/170097967-78745e43-4955-4577-aa26-7c5503592f82.mp4

May 24 '22 17:05 nonumeros

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Obviously the above is the current locale even before the shown behavior. I'm just posting locale as a reference here.

There are (I think) two issues here and allow me to explain.

But before getting there, this is the same hardware architecture but running instead under archlinuxarm and not under any Ubuntu derivative distro.

There was an issue I opened a while back, in which @Screwtapello kindly explained it further with the locale settings and the sole objective of kakoune with other modern operating systems and such. He also wrote about other modern terminals (I don't think is necessary to link it outline it here though). I just wanted to clarify that part.

So the first issue here is more than obvious, and that's the reason of the screencast.

But before getting there. There are two things.

The contents of this file, although at the beginning, it wasn't even a file, but a formation of human readable characters, that were originally yanked through the X system clipboard. So going by this fact and if I had to guess, kakoune rightly so, has nothing to do with these operations. Yanking and pasting is done with other tools.

I surely don't remember if pasting was done here specifically with xclip but is highly likely this may have been the case.

It doesn't happen with say xfce4-terminal but it surely happens with modern terminals such as foot-terminal.

https://user-images.githubusercontent.com/11535575/170106105-1e355a52-532d-469c-a551-1fcece2143ab.mp4

May 24 '22 18:05 nonumeros

Before going any further, I can't reproduce it with kak -n

~~Disregard this issue.~~

It's simply unacceptable for me, that by just having addhl global/ wrap on the config file, this behavior occurs.

May 24 '22 18:05 nonumeros

I had to scratch off the «disregard this issue» part. There must be something else which I'm unaware of.

With kak -n all the characters appear. Also by commenting out addhl global\ wrap. I can have a screencast confirm it. But it's not important really.

But as soon as the addhl global\ wrap is specified on the config file, all of the above as it was first reported happens.

May 24 '22 19:05 nonumeros

There must be something else going on. (As I said before), besides kak.

But if something as simple (don't you think? and this is not a question begged statement), but if something as simple as an addhl global/ wrap is further contributing to it, one might have to rethink about a re-implementation here. That's my opinion.

I'm compiling now the most recent commit from a few days ago, and hopefully, hopefully it will eventually finish. Which it did, just before I finished writing the prior sentence. And it shows the same behavior as before.

And by comparing the two instances (sessions) one with addhl global/ wrap and the other one without it, it seems as if not only \n is causing havoc, but also ۩ . In other words, and putting aside, (I think) the encoding part of the file, in this particular scenario, using the terminal rather than the editor

Ejemplo

The X11 connection broke (error 1). Did the X11 server die?                                                                                                                                       ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������                                            
The X11 connection broke (error 1). Did the X11 server die?                                                                                                                                       ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������                                            
The X11 connection broke (error 1). Did the X11 server die?                                                                                                                                       ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������                                            
The X11 connection broke (error 1). Did the X11 server die?                                                                                                                                       ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������                                            
kdeinit5: Fatal IO error: client killed                                                                                                                                                           ۩������ ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� ����������� �����������

The above was with xterm

Now this one with foot-terminal

Exiting due to channel error.                                                                                                                                     ۩                
Exiting due to channel error.                                                                                                                                     ۩                
Exiting due to channel error.                                                                                                                                     ۩                
Exiting due to channel error.                                                                                                                                     ۩                
Exiting due to channel error.

May 24 '22 20:05 nonumeros

Can you write that scratch buffer to disk and attach it to this issue, so we can see what exact bytes are causing problems? Also, is your wrap command literally just addhl global/ wrap or are you adding extra flags like -indent or -marker?

May 25 '22 06:05 Screwtapello

Can you write that scratch buffer to disk and attach it to this issue, so we can see what exact bytes

How do I do that? Similarly to vim -b <file>?

adding extra flags like -indent or -marker?

no. Just addhl global/ wrap

May 25 '22 15:05 nonumeros

How do I do that?

Just :w ~/scratch.txt, click on the "Attach files by dragging & dropping" text below this comment, and your browser should present a file-picker to let you attach the ~/scratch.txt file.

The screenshots you've posted don't seem like they'd be affected by anything Vim's binary mode would help with (in particular, the lack of a newline at the end of the file's last line).

May 25 '22 15:05 Screwtapello

scratch.txt

May 25 '22 15:05 nonumeros

OK, I see what's going on here. Here's one of the lines that's actually causing problems:

$ head -20 scratch.txt | tail -1 | hexdump -C
00000000  54 68 65 20 57 61 79 6c  61 6e 64 20 63 6f 6e 6e  |The Wayland conn|
00000010  65 63 74 69 6f 6e 20 62  72 6f 6b 65 2e 20 44 69  |ection broke. Di|
00000020  64 20 74 68 65 20 57 61  79 6c 61 6e 64 20 63 6f  |d the Wayland co|
00000030  6d 70 6f 73 69 74 6f 72  20 64 69 65 3f 20 20 20  |mpositor die?   |
00000040  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
000000a0  20 20 db a9 fd bf bf bf  bf bf 20 f8 a0 80 82 80  |  ........ .....|
000000b0  fd bf bf bf bf bf 20 f8  a0 80 82 80 fd bf bf bf  |...... .........|
000000c0  bf bf 20 f8 a0 80 82 80  fd bf bf bf bf bf 20 f8  |.. ........... .|
000000d0  a0 80 82 80 fd bf bf bf  bf bf 20 f8 a0 80 82 80  |.......... .....|
000000e0  fd bf bf bf bf bf 20 f8  a0 80 82 80 fd bf bf bf  |...... .........|
000000f0  bf bf 20 f8 a0 80 82 80  fd bf bf bf bf bf 20 f8  |.. ........... .|
00000100  a0 80 82 80 fd bf bf bf  bf bf 20 f8 a0 80 82 80  |.......... .....|
00000110  fd bf bf bf bf bf 20 f8  a0 80 82 80 fd bf bf bf  |...... .........|
00000120  bf bf 20 f8 a0 80 82 80  fd bf bf bf bf bf 20 f8  |.. ........... .|
00000130  a0 80 82 80 fd bf bf bf  bf bf 20 f8 a0 80 82 80  |.......... .....|
00000140  fd bf bf bf bf bf 20 f8  a0 80 82 80 fd bf bf bf  |...... .........|
00000150  bf bf 20 f8 a0 80 82 80  fd bf bf bf bf bf 20 f8  |.. ........... .|
00000160  a0 80 82 80 fd bf bf bf  bf bf 0a                 |...........|
0000016b

There's a bunch of perfectly reasonable ASCII text, then the bytes DB A9 (the UTF-8 encoding of U+06E9), which renders nicely), then an FD byte and 5 BF bytes.

In binary, 0xFD is 0b11111101, which would be the first byte of a 6-byte UTF-8 sequence. 0xBD is 0b10111111, which is a perfectly legitimate UTF-8 continuation byte. Together, this would be the UTF-8 sequence U+7FFFFFFF if Unicode weren't capped at U+10FFFF.

After an ASCII space (0x20) we have F8 A0 80 82 80, or in binary:

If we decode that as UTF-8, we get the invalid character U+800080. The rest of the line is various repetitions of these two invalid characters, up until a newline character (0x0A).

Here's what I think's going on:

When Kakoune displays a line of text, it estimates how many terminal character cells the text will take up, so it can crop the text at the right hand column, or to leave room for an info box like the g or v menus. Once it sends the line of text to the terminal, the terminal does its own measurement to decide how many character cells the text will take up, and Kakoune doesn't have any control over what the terminal does here — it just has to trust that terminal will agree with Kakoune's estimation.

For a character like U+0041 LATIN CAPITAL LETTER A, it is well established that it should take exactly one terminal character cell, so it is extremely likely that Kakoune and the terminal will agree.

For a character like U+2500 BOX DRAWINGS LIGHT HORIZONTAL, the Unicode standard says it could be one or two character cells, but usually the value depends on the operating system's currently-active locale, so Kakoune can ask the operating system and assume the terminal will also ask the operating system.

For a "character" like U+7FFFFFFF, there is no well-established convention, there is no record in the operating system's locale database, and it's not even clear how many characters there are here. A system that does the full UTF-8 decode and then checks against the U+10FFFF limit might decide there's one invalid character; a system that first checks against the 4-byte UTF-8 character limit might decide that FD BF BF BF is one invalid character and the following BF BF are two more invalid characters; a system that first notices that FD is an invalid UTF-8 byte might report six independent invalid characters. On top of that, some systems replace each byte of an invalid character with a U+FFFD REPLACEMENT CHARACTER, some replace each invalid character, and some replace each sequence of invalid characters with a single replacement character.

As you can see, there's a lot of scope for Kakoune, the OS, and the terminal to violently disagree about how many cells U+7FFFFFFF should take up, and when they do, you get weird behaviour like in your screen shots.

People who want a simpler test case can try this:

printf '\n\n\n\n\n\n\n\n\n\n\n\n\n\xfd\xbf\xbf\xbf\xbf\xbf' | kak -e 'exec g'

in gnome-terminal, it displays as "��" and the goto menu appears correctly, but pressing l on the first character moves directly to the newline afterward
in xterm, it displays as a single blank cell and the goto menu is misaligned; if you move the cursor onto the invalid character, it expands to be 6 blank cells and the menu looks OK, but l moves directly to the newline
in urxvt, it displays as a single "no such character" rectangle and the menu is misaligned; if you move the cursor onto the invalid character, it expands to be "ý¿¿¿¿¿"
in qterminal it displays as "��" and the goto menu appears no more messed up than it usually does
in pterm it displays as "▒" and expands to "▒▒▒▒▒▒" when you move the cursor onto it
in kitty it displays as zero-width, even when you move the cursor onto it
in terminology it displays as "�" and expands to "��" when you move the cursor onto it

I suppose for maximal predictabliity, Kakoune needs to be more aggressive about not sending invalid UTF-8 to the terminal, even if it's in the buffer, replacing semantically invalid characters with a single "�" in the same way it replaces syntactically-invalid characters (printf '\xfd\xfd' | kak).

In addition, Kakoune should make sure that h and l agree with the display code on what constitutes a "character". It's weird that you can put the cursor on the first "�" in the sequence individually, but l skips over the entire "��" sequence — as opposed to something like a tab character, where l skips over the entire thing, but when you put the cursor on the first cell, the entire thing gets highlighted.

May 25 '22 17:05 Screwtapello

kakoune kakoune copied to clipboard

Arch Linux Arm 2021-11 unable to refresh screen with unknown characters

Question

kakoune
kakoune copied to clipboard