terminal icon indicating copy to clipboard operation
terminal copied to clipboard

Only one part of unicode character get deleted on backspace in WSL/bash (again)

Open EugZol opened this issue 4 years ago • 7 comments

Environment

Windows:

Platform ServicePack Version      VersionString
-------- ----------- -------      -------------
 Win32NT             10.0.19041.0 Microsoft Windows NT 10.0.19041.0

WSL2:

PS C:\Users\eugzol> bash -li
eugzol@DESKTOP-FAV7PTR:/mnt/c/Users/eugzol $ uname -a
Linux DESKTOP-FAV7PTR 4.19.128-microsoft-standard #1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce

eugzol@DESKTOP-FAV7PTR:~ $ cat test.sh
read user_input
echo $user_input | xxd
eugzol@DESKTOP-FAV7PTR:~ $ chmod +x test.sh
eugzol@DESKTOP-FAV7PTR:~ $ ./test.sh
а
00000000: d0d0 b00a                                ....
eugzol@DESKTOP-FAV7PTR:~ $ bash -li test.sh
а
00000000: d0d0 b00a                                ....
eugzol@DESKTOP-FAV7PTR:~ $ echo $LANG
ru_RU.UTF-8

When running test_sh script above I enter а (lowercase Cyrillic letter), then press backspace, then enter а again, then press Enter.

Hex representation of the resultant string, as shown above, is d0 d0b0 0a. d0b0 is the correct representation of а. First d0 is the first byte of the а which was entered before pressing backspace. Second byte was deleted. Last 0a is the new line.

Expected behavior

Hex representation of the resultant string is d0b0 0a. That is, all bytes of the а symbol is deleted when pressing backspace.

Actual behavior

Backspace deletes only the last byte of two-byte symbol.

EugZol avatar Feb 18 '21 13:02 EugZol

#5057 was closed previously. Recommendations from that issue didn't help.

EugZol avatar Feb 18 '21 13:02 EugZol

Out of curiosity, what's the value of LC_ALL? Sometimes that makes a difference

zadjii-msft avatar Feb 18 '21 15:02 zadjii-msft

Hm, it's empty:

 $ locale
LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=

If I set it manually, the error reproduces:

eugzol@DESKTOP-FAV7PTR:~ $ export LC_ALL=ru_RU.UTF-8
eugzol@DESKTOP-FAV7PTR:~ $ ./test.sh
а
00000000: d0d0 b00a                                ....
eugzol@DESKTOP-FAV7PTR:~ $ locale
LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=ru_RU.UTF-8

However, LC_ALL doesn't persist after I re-login to bash (even if I restart WSL), when I try to add it to /etc/default/local:

eugzol@DESKTOP-FAV7PTR:~ $ cat /etc/default/locale
LANG=ru_RU.UTF-8
LC_ALL=ru_RU.utf8
eugzol@DESKTOP-FAV7PTR:~ $ exit
выход
PS C:\Users\eugzol> bash
eugzol@DESKTOP-FAV7PTR:/mnt/c/Users/eugzol $ locale
LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=

EugZol avatar Feb 18 '21 16:02 EugZol

So, there's one critical issue here ... and that's that we send bash (in any incarnation) a single backspace character (0x08/0x7f depending on encoding). That's by design: all terminals work like this. That's the only way it can be done: since the terminal cannot know how the receiving end will react to a backspace, it can't necessarily guess, for example, that it's going to delete one cell worth of data and send enough backspaces to destroy that cell.

DHowett avatar Feb 18 '21 18:02 DHowett

However, you're right. There's something odd happening here. Both the Windows Console and Terminal do this differently than urxvt and Konsole on the same system/configuration.

image

DHowett avatar Feb 18 '21 20:02 DHowett

The same can be seen when using input in ipython

#!/usr/bin/env python3
x = input("> ")
print(x)
print(len(x))

when removing one unicode character and replacing it using backspace it results in:

python3 getinput.py
> huhöäüöäxxx
huhöäüöä�xxx
12

This does not happen when using a different terminal, e.g. terminator, instead of the windows terminal.

mutax avatar Apr 18 '24 11:04 mutax

However, you're right. There's something odd happening here. Both the Windows Console and Terminal do this differently than urxvt and Konsole on the same system/configuration.

I've looked into this today, and found that WSL indeed only uses GetConsoleInputW and that we return only the expected characters (U+0430 and U+007f) and nothing else. So it should theoretically not be a bug in conhost. I'm not sure how Linux and WSL handle line input in a VM like that, but I feel like it's not unlikely that this is a bug in WSL.

lhecker avatar Apr 18 '24 16:04 lhecker