emacs-libvterm icon indicating copy to clipboard operation
emacs-libvterm copied to clipboard

don't garble partial multi-byte character after control sequence

Open Dieken opened this issue 1 year ago • 2 comments

When use lf to list files, emacs-libvterm may read partial multi-byte character, for example:

$ echo -n '招聘' | hexdump -C 00000000 e6 8b 9b e8 81 98

; get "招", control sequence and partial character (vterm--filter process "\xE6\x8B\x9B\e[14;111H\xE8")

; now full "聘" (vterm--filter process "\x81\x98")

This will send "\xE8" to libvterm which is not a full character.

Dieken avatar Jun 26 '24 17:06 Dieken

This looks good as far as I can tell (so I'd go ahead and merge it), but can you please explain a little bit more what's happening here?

Sbozzolo avatar Jun 26 '24 17:06 Sbozzolo

This looks good as far as I can tell (so I'd go ahead and merge it), but can you please explain a little bit more what's happening here?

(vterm--filter process "\xE6\x8B\x9B\e[14;111H\xE8")
(vterm--filter process "\x81\x98")

It translates to calls:

; write "招", ok
(vterm--write-input vterm--term (decode-coding-system "\xE6\x8B\x9B" locale-coding-system t))

; move cursor position to line 14 column 111, ok
(vterm--write-input vterm--term ("\e[14;111H"))

; write UTF-8 encoded string "\xC0\xE8",  BAD
(vterm--write-input vterm--term (decode-coding-system "\xE8" locale-coding-system t))

; write UTF-8 encoded string "\xC0\x81\xC0\x98", BAD 
(vterm--write-input vterm--term (decode-coding-system "\x81\x98" locale-coding-system t))

The last two calls to vterm--write-input:

Fvterm_write_input(env, nargs, args, data)
    len = string_bytes(env, args[1]);
         env->copy_string_contents(env, args[1], NULL, &size);
             module_copy_string_contents(env, args[1], NULL, len);
                lisp_str_utf8 = encode_string_utf_8(lisp_str, Qnil, true, Qnil /* HANDLE-8-BIT */, Qnil); 

   /* len is 0 now !!! */
  
   env->copy_string_contents(env, args[1], bytes, &len);

   vterm_input_write(term->vt, bytes, len);   //  zero bytes !!!

Because HANDLE_8_BIT is Qnil, encode_string_utf_8 returns NULL for "\xC0\xE8" and "\xC0x81\xC0\x98", the character "聘"(\xE8\x81\x98) is thrown away.

This patch buffers the trailing "\xE8" for next call to vterm--filter to form a valid full UTF-8 character. Actually the original code considers partial multi-bytes character, but it has an off-by-one error.

Dieken avatar Jun 27 '24 06:06 Dieken

@Sbozzolo @jixiuf could you merge this?

Dieken avatar Jul 04 '24 02:07 Dieken