terminal
terminal copied to clipboard
Only one part of unicode character get deleted on backspace in WSL/bash (again)
Environment
Windows:
Platform ServicePack Version VersionString
-------- ----------- ------- -------------
Win32NT 10.0.19041.0 Microsoft Windows NT 10.0.19041.0
WSL2:
PS C:\Users\eugzol> bash -li
eugzol@DESKTOP-FAV7PTR:/mnt/c/Users/eugzol $ uname -a
Linux DESKTOP-FAV7PTR 4.19.128-microsoft-standard #1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Steps to reproduce
eugzol@DESKTOP-FAV7PTR:~ $ cat test.sh
read user_input
echo $user_input | xxd
eugzol@DESKTOP-FAV7PTR:~ $ chmod +x test.sh
eugzol@DESKTOP-FAV7PTR:~ $ ./test.sh
а
00000000: d0d0 b00a ....
eugzol@DESKTOP-FAV7PTR:~ $ bash -li test.sh
а
00000000: d0d0 b00a ....
eugzol@DESKTOP-FAV7PTR:~ $ echo $LANG
ru_RU.UTF-8
When running test_sh
script above I enter а
(lowercase Cyrillic letter), then press backspace, then enter а
again, then press Enter.
Hex representation of the resultant string, as shown above, is d0 d0b0 0a
. d0b0
is the correct representation of а
. First d0
is the first byte of the а
which was entered before pressing backspace. Second byte was deleted. Last 0a
is the new line.
Expected behavior
Hex representation of the resultant string is d0b0 0a
. That is, all bytes of the а
symbol is deleted when pressing backspace.
Actual behavior
Backspace deletes only the last byte of two-byte symbol.
#5057 was closed previously. Recommendations from that issue didn't help.
Out of curiosity, what's the value of LC_ALL
? Sometimes that makes a difference
Hm, it's empty:
$ locale
LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=
If I set it manually, the error reproduces:
eugzol@DESKTOP-FAV7PTR:~ $ export LC_ALL=ru_RU.UTF-8
eugzol@DESKTOP-FAV7PTR:~ $ ./test.sh
а
00000000: d0d0 b00a ....
eugzol@DESKTOP-FAV7PTR:~ $ locale
LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=ru_RU.UTF-8
However, LC_ALL
doesn't persist after I re-login to bash (even if I restart WSL), when I try to add it to /etc/default/local
:
eugzol@DESKTOP-FAV7PTR:~ $ cat /etc/default/locale
LANG=ru_RU.UTF-8
LC_ALL=ru_RU.utf8
eugzol@DESKTOP-FAV7PTR:~ $ exit
выход
PS C:\Users\eugzol> bash
eugzol@DESKTOP-FAV7PTR:/mnt/c/Users/eugzol $ locale
LANG=ru_RU.UTF-8
LANGUAGE=
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=
So, there's one critical issue here ... and that's that we send bash (in any incarnation) a single backspace character (0x08/0x7f depending on encoding). That's by design: all terminals work like this. That's the only way it can be done: since the terminal cannot know how the receiving end will react to a backspace, it can't necessarily guess, for example, that it's going to delete one cell worth of data and send enough backspaces to destroy that cell.
However, you're right. There's something odd happening here. Both the Windows Console and Terminal do this differently than urxvt and Konsole on the same system/configuration.
The same can be seen when using input
in ipython
#!/usr/bin/env python3
x = input("> ")
print(x)
print(len(x))
when removing one unicode character and replacing it using backspace it results in:
python3 getinput.py
> huhöäüöäxxx
huhöäüöä�xxx
12
This does not happen when using a different terminal, e.g. terminator, instead of the windows terminal.
However, you're right. There's something odd happening here. Both the Windows Console and Terminal do this differently than urxvt and Konsole on the same system/configuration.
I've looked into this today, and found that WSL indeed only uses GetConsoleInputW
and that we return only the expected characters (U+0430 and U+007f) and nothing else. So it should theoretically not be a bug in conhost. I'm not sure how Linux and WSL handle line input in a VM like that, but I feel like it's not unlikely that this is a bug in WSL.