gdb或者lldb调试程序时中文乱码问题
Description / Steps to reproduce the issue
环境:
OS: win10 22H2
gcc version 13.1.0 (Rev6, Built by MSYS2 project)
GNU gdb (GDB) 13.2
clang version 16.0.4
lldb version 16.0.4
msys2 MinGW64使用pacman -Suy更新到最新
问题: a. 在MinGW Shell终端中使用gdb和lldb调试带有中文的字符串,不管在源码显示还是输出变量均为乱码。 b. 在Windows Cmd终端或者PowerShell中,gdb调试带中文的字符串显示正常,但lldb显示乱码。
本来以为是gdb或者lldb的BUG,但是在Linux下测试,发现gdb以及lldb都是正常的显示中文的。
复现步骤:
- 创建一个简单的C++程序
main.cpp,并保存为UTF-8编码:
#include <string>
int main(int argc, char* argv[])
{
const char* str = "测试";
std::string name = "测试";
return 0;
}
2.使用命令 g++ -gdwarf-4 main.cpp -o main.exe 生成程序
3.在MinGW64的shell中调试
设置MinGW64的shell选项中文本的字符集为:zh_CN.UTF-8
1). 使用lldb调试
$ echo $LANG
zh_CN.UTF-8
$ lldb main
(lldb) target create "main"
Current executable set to 'C:\Users\admin\Desktop\demo\main.exe' (x86_64).
(lldb) b main.cpp:7
Breakpoint 1: where = main.exe`main + 83 at main.cpp:7:9, address = 0x00000001400014a3
(lldb) r
(lldb) Process 62076 launched: 'C:\Users\admin\Desktop\demo\main.exe' (x86_64)
Process 62076 stopped
* thread #1, stop reason = breakpoint 1.1
frame #0: 0x00007ff7f81114a3 main.exe`main(argc=1, argv=0x0000026e62964960) at main.cpp:7:9
4 {
5 const char* str = "娴嬭瘯";
6 std::string name = "娴嬭瘯";
-> 7 return 0;
8 }
(lldb) p str
(const char *) $0 = 0x00007ff7f8114000 "娴嬭瘯"
(lldb) p name
(std::string) $1 = "娴嬭瘯"
(lldb)
2).使用gdb调试
$ gdb main
GNU gdb (GDB) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from main...
(gdb) b main.cpp:7
Breakpoint 1 at 0x1400014a3: file main.cpp, line 7.
(gdb) r
Starting program: C:\Users\admin\Desktop\demo\main.exe
[New Thread 29368.0x19488]
[New Thread 29368.0x19eb4]
[New Thread 29368.0x7fdc]
Thread 1 hit Breakpoint 1, main (argc=1, argv=0xf34d40) at main.cpp:7
7 return 0;
(gdb) l
2
3 int main(int argc, char* argv[])
4 {
5 const char* str = "娴嬭瘯";
6 std::string name = "娴嬭瘯";
7 return 0;
8 }
(gdb) p str
$1 = 0x7ff674014000 "娴嬭瘯"
(gdb) p name
$2 = "娴嬭瘯"
(gdb)
可以看出在MinGW Shell中不管是lldb还是gdb的源码显示以及调试的变量输出都是乱码,实际应该显示"娴嬭瘯"为“测试”。
5.在Windows Cmd终端或者PowerShell测试
先使用chcp 65001设置终端为UTF-8字符集
1).使用lldb调试:
C:\Users\admin\Desktop\demo>lldb main
(lldb) target create "main"
(rrent executable set to 'C:\Users\admin\Desktop\demo\main.exe' (x86_64).
(lldb) b main.cpp:7
Breakpoint 1: where = main.exe`main + 83 at main.cpp:7:9, address = 0x00000001400014a3
(lldb) r
(lldb) Process 87536 launched: 'C:\Users\admin\Desktop\demo\main.exe' (x86_64)
Process 87536 stopped
* thread #1, stop reason = breakpoint 1.1
frame #0: 0x00007ff6740114a3 main.exe`main(argc=1, argv=0x000001ab47a11820) at main.cpp:7:9
4 {
5 const char* str = "娴嬭瘯";
6 std::string name = "娴嬭瘯";
-> 7 return 0;
8 }
(lldb) p str
(const char *) $0 = 0x00007ff674014000 "娴嬭瘯"
(lldb) p name
(std::string) $1 = "娴嬭瘯"
(lldb)
2).使用gdb调试:
C:\Users\admin\Desktop\demo>gdb main
GNU gdb (GDB) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-w64-mingw32".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from main...
(gdb) b main.cpp:7
Breakpoint 1 at 0x1400014a3: file main.cpp, line 7.
(gdb) r
Starting program: C:\Users\admin\Desktop\demo\main.exe
[New Thread 148736.0x2ed90]
[New Thread 148736.0x15bfc]
[New Thread 148736.0x20eb4]
Thread 1 hit Breakpoint 1, main (argc=1, argv=0xc1c00) at main.cpp:7
7 return 0;
(gdb) l
2
3 int main(int argc, char* argv[])
4 {
5 const char* str = "测试";
6 std::string name = "测试";
7 return 0;
8 }
(gdb) p str
$1 = 0x7ff674014000 "测试"
(gdb) p name
$2 = "测试"
(gdb)
可以看出在Windows终端设置字符集为UTF-8的情况下,gdb显示正常,但是lldb显示乱码。
Expected behavior
期望不管是在MinGW Shell终端还是Windows终端或者PowerShell中,设置字符集为UTF8时,gdb和lldb调试带中文字符的变量时,源码以及输出的变量均能正常显示。
Actual behavior
- 在MinGW Shell终端中,设置字符集为UTF8,使用gdb和lldb调试带有中文的字符串,不管在源码显示还是输出变量均为乱码。
- 在Windows Cmd终端或者PowerShell中,字符集为UTF8,使用gdb调试带中文的字符串显示正常,但lldb显示乱码。
Verification
- [X] I have verified that my MSYS2 is up-to-date before submitting the report (see https://www.msys2.org/docs/updating/)
Windows Version
MINGW64_NT-10.0-19045
MINGW environments affected
- [X] MINGW64
- [x] MINGW32
- [x] UCRT64
- [ ] CLANG64
- [ ] CLANG32
- [ ] CLANGARM64
Are you willing to submit a PR?
No response
const char* str = "娴嬭瘯";
看起来是 GDB 把源码当作 GB18030 了,不过我的系统是全局 UTF-8 的,理论上是不应该有 936 代码页。我记得编译 GDB 的配置里有些 libiconv 的参数,回头找找看。
It looks like GDB misinterprets source code in GB18030. But I have global UTF-8 turned on on my system, so in principle there shouldn't be anything about Code Page 936. IIRC there is kinda configuration about libiconv when building GDB; maybe we should have a look into that.
It looks like GDB uses ncurses to display source code, which uses mbrtowc() to convert narrow strings to wide strings ^1, which for MSVCRT goes into the private implementation of mingw-w64, which unfortunately does not support UTF-8.
So this is not fixable for MSVCRT. Either don't have non-ASCII characters in source code, or use the UCRT GDB.
Is ncurses patchable?
我发现在MinGW Shell中不管是lldb还是gdb,在调试时执行一次p (int) SetConsoleOutputCP(65001)命令,则该MinGW Shell会话,后续调试时UTF8字符串基本可以正常显示,但有些汉字还是无法正常显示,比如字符串你好,世界的界字无法正常显示。
(gdb) p (int) SetConsoleOutputCP(65001)
$1 = 1
(gdb) p name
$2 = "你好,世\214!\n"
(gdb) p str
$3 = "你好,世", <incomplete sequence \214>
(gdb)
但是如果重新开一个新的MinGW Shell窗口,则又会是乱码。
所以,看起来应该是MinGW Shell在gdb或者lldb调试时没有正确设置字符集导致的问题。但是使用SetConsoleOutputCP(65001)设置了字符集后,汉字显示不全的问题不知道是啥情况了。
I hope I wouldn't have to repeat myself: If you want sane UTF-8 support, then you must use UCRT. Anything about UTF-8 on MSVCRT is unspecified and unsupported.
@lhmouse @MehdiChinoune 经过测试,在UCRT64中一样存在这样的问题,之前MINGW environments affected只选择了MinGW是因为只在MinGW中测试过,其它没选择的没做过测试。
我使用UCRT64中的gcc进行编译,UCRT64中的gdb进行调试。
After testing, it has been found that there is still such a problem in UCRT64. Previously, 'MINGW environments affected' only selected MinGW because it has only been tested in MinGW, while others that were not selected have not been tested.
I use gcc in UCRT64 for compilation and gdb in UCRT64 for debugging.
GDB (rather, ncurses) knows nothing about the encoding of source files. GDB just calls setlocale(LC_ALL, "") and ncurses will take whatever the CRT happens to provide.
One more word:
I don't know how native curses programs communicate with mintty. It is possible that mintty allocates a new console for the native GDB, or communicates with GDB through a pipe. In either case, if you are on a Simplified Chinese system and have not had the global UTF-8 be turned on, the default code page for the native GDB will be .936 and not .65001.
If then native GDB is launched in a Windows console, such as CMD or PowerShell, the default code page for CRT functions follows the code page of that console, with respect to your previous CHCP request, if any. But if the active code page is .65001, MSVCRT still operates in the .936 code page because it doesn't support .65001. Any requests to set a UTF-8 code page will fail.
默认情况下,MSYS的Shell控制台(不管是MINGW还是UCRT64,其它的没测试),虽然在“选项”中设置了本地Locale为zh_CN.UTF-8,但是还是使用的与系统一致的代码页进行显示,简体中文为936,并未改为65001,可能这是一个问题。这也是为什么程序中使用了SetConsoleOutputCP(65001)后,gdb就基本正常了,但是lldb还是乱码(UCRT64也是),就不知道怎么回事了。
By default, the shell console of MSYS (whether it is MINGW or UCRT64, others are not tested), although the local Locale is set to zh_CN. UTF-8 in Options, but it is still displayed using a code page that is consistent with the system. The simplified Chinese version is 936 and has not been changed to 65001, which may be an issue. This is also why after using SetConsoleOutputCP (65001) in the program, gdb is basically normal, but lldb is still garbled (as is UCRT64), so I don't know what's going on.
The locale setting of MSYS2 shell only sets LC_* environment variables (particularly LC_CTYPE) and only has an effect on MSYS2 programs. Native programs do not acquire locale settings from environment variables ^1.
setlocale( LC_ALL, "" );Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system. The locale name is set to the value returned by GetUserDefaultLocaleName. The code page is set to the value returned by GetACP.
Windows控制台的文字显示,与程序内的字符串编码、setlocale设置的locale以及Windows Console的显示代码页三方都有关,最好是三者的代码页一致。
用户程序代码内没有使用setlocale设置locale,也没有使用SetConsoleOutputCP设置Windows Console代码页时,理论上程序应该是使用环境的设置,比如MSYS环境设置的zh_CN.UTF-8。
Btw,MinGW的printf输出,最终是一个字节一字节写入到Windows Console的,Win10下正常情况下没问题,不知道在某些特殊情况下会不会出现问题,Win7及之前的版本是肯定有问题的;而Win7下的VS版本没问题,它是调用WriteFile写入到Windows Console的。
MinGW的调用:
VS的调用:
希望这个线索对修改乱码的BUG有用。
Windows控制台的文字显示,与程序内的字符串编码、
setlocale设置的locale以及Windows Console的显示代码页三方都有关,最好是三者的代码页一致。 用户程序代码内没有使用setlocale设置locale,也没有使用SetConsoleOutputCP设置Windows Console代码页时,理论上程序应该是使用环境的设置,比如MSYS环境设置的zh_CN.UTF-8。
Are you aware of the fact that setlocale() for native Windows programs do not load locale settings from environment variables?
And are you aware of the fact that zh_CN.UTF-8 is an invalid locale name for native Windows programs?
And are you aware of the fact that MSVCRT does not support UTF-8 at all?
Btw,MinGW的printf输出,最终是一个字节一字节写入到Windows Console的,Win10下正常情况下没问题,不知道在某些特殊情况下会不会出现问题,Win7及之前的版本是肯定有问题的;而Win7下的VS版本没问题,它是调用WriteFile写入到Windows Console的。
Yes, we've been aware of it, but sadly there hasn't been an acceptable solution.
Are you aware of the fact that setlocale() for native Windows programs do not load locale settings from environment variables? And are you aware of the fact that zh_CN.UTF-8 is an invalid locale name for native Windows programs? And are you aware of the fact that MSVCRT does not support UTF-8 at all?
你怕是对我所说有误解吧?
@Biswa96 @MehdiChinoune Do you have any ideas about how mintty communicates with native console programs? Perhaps the best we can do is that if mintty starts a native program, it can set the console input and output code pages before actually creating the child process, such that if the native program is linked against UCRT (for UCRT64 and CLANG*) it will initialize a UTF-8 locale. Again for MSVCRT this is not fixable.
Do you have any ideas about how mintty communicates with native console programs?
The only thing I know is that 'conpty' by-default. So, console handles, pipes etc. For in-depth info, I can ask mintty developer.
I have the same issue, but recently I have found a workaround. For better illustration, let me show you an example:
In .gdbinit, I set charset used by gdb to UTF-8:
set charset UTF-8
layout src
set disassembly-flavor intel
# ...
// test.c
#include <stdio.h>
int main(int argc, char *argv[])
{
const char* s = "你好 こんにちは 안녕하세요";
puts(s);
return 0;
}
Use UCRT64 or CLANG64 toolchain to compile the above code, then use GDB to debug it.
[!NOTE] If the file name contains unicode characters, you have to use CLANG64 toolchain to compile.
https://github.com/user-attachments/assets/f3b9fdba-3445-4131-8f72-f0e4159182f7
In this video, gdb garbles s.
s is decoded using GB2312 (i.e., Simplified Chinese CP 936, my system locale) even when source code file, shell's input /output, code page, etc all set to UTF-8.
You may use mt.exe (provided in Windows SDK, so you need to install MSVC) to inject the following manifest file with activeCodePage into gdb, which can force a process to use UTF-8.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="..." version="6.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
mt.exe -manifest <manifest path> -outputresource:"<gdb.exe path>;#1"
https://github.com/user-attachments/assets/149a22da-c649-4f18-a55e-53a9be83a4ce
You may notice I use a command SelectCppToolchain, it is my customized powershell function to switch between different cpp toolchains.
References:
The drawback of this workaround is apparent, you have to manually inject manifest into gdb.exe again each time you update it. But since most people do not update frequently, so I think this approach is better than setting the Windows system locale to UTF-8, which may harm those legacy programs.
我使用最新版本的MinGW试了一下,目前gdb和lldb的汉字显示都正常了,lldb调试的程序输出汉字前需要调用SetConsoleOutputCP(65001)就可以正常显示汉字了。
不过gdb的tui显示的源码中汉字还是乱码,调试使用p命令显示时,汉字也是乱码。但程序输出,与lldb一样,只要调用了SetConsoleOutputCP(65001)就可以正常显示汉字。另外刚进入tui模式时,字符光标显示有问题,不是在最后一个字符的后面,而是在前面。
I tried using the latest version of MinGW, and currently the Chinese characters displayed in gdb and lldb are normal. Before the lldb debugging program outputs Chinese characters, it needs to call SetConsoleOutputCP(65001) to display them normally.
However, the Chinese characters displayed in the source code of gdb's tui are still garbled, and when debugging and using the p command, the Chinese characters are also garbled. But the program output, just like lldb, can display Chinese characters normally as long as it calls SetConsoleOutputCP(65001). In addition, when entering tui mode, there is a problem with the character cursor display, which is not after the last character, but before it.