terminal
terminal copied to clipboard
Incorrect encoding although code page is correct
Windows Terminal version
1.18.10301.0
Windows build number
10.0.19045.0
Other Software
PowerShell 7.4.1
Diagon (diagon.exe from diagon-1.1.156-win64.zip)
Steps to reproduce
This is an issue that occurs on Terminal but not Conhost.
Executing the command diagon Math -- "1+1/2 + sum(i,0,10) = 112/2" would output:
10
___
1 ╲ 112
1 + ─ + ╱ i = ───
2 ‾‾‾ 2
0
The output is correct on UTF-8 (65001) and would be broken on "Multilingual (Latin I)" (850).
When incorrect in encoding:
10
___
1 Ôò▓ 112
1 + ÔöÇ + Ôò▒ i = ÔöÇÔöÇÔöÇ
2 ÔÇ¥ÔÇ¥ÔÇ¥ 2
0
Expected Behavior
Output should be correct when encoding is UTF-8 (setting code page to 65001)
Actual Behavior
The output is incorrect on Terminal even when the encoding is UTF-8.
I have verified this with chcp and the code page is 65001
Screenshots for reference
Conhost (correct behaviour)
WIndows Terminal (incorrect behaviour)
I am unsure as to whether this would help identifying the issue, but here are a few thing I noticed:
- The issue occurs for both powershell and cmd.
- On powershell, the issue also occurs when the output stream is piped to "Set-Clipboard".
- i.e. executing
diagon Math -- "1+1/2 + sum(i,0,10) = 112/2" | scbwould fill our clipboard with incorrectly encoded data.
- i.e. executing
- On powershell, the issue does NOT occur when the output stream is redirected to a file!
- i.e. executing
diagon Math -- "1+1/2 + sum(i,0,10) = 112/2" > temp.txtwould filltemp.txtwith correctly encoded data.
- i.e. executing
I must be doing something wrong
I noticed that you're using Windows 10. Thinking about this some more, this may be an issue with Windows 10, since its CRT (C/C++ stdlib) version is older, and the CRT never had excellent Unicode support. It's much better now, but it's still somewhat broken for surrogate pairs and such.
I believe this is the case, because this issue doesn't reproduce for me on Windows 11, and it doesn't make any sense why chcp would have no effect for you. chcp definitely does work correctly, because I know that it changes the result of GetConsoleOutputCP and we haven't touched either of the two in a decade.
In other words, I think this may be an CRT issue. You could potentially fix it by compiling diagon yourself with the latest version of the Windows SDK. Potentially you need to statically link the CRT.
I'll leave this issue open because I can't really prove that it's due to the CRT.
I noticed in your screenshot that the output is correct on code page 437, but it is not the case on my machine...
I know very little about CRT and I don't quite understand how it could explain the different behavior on Conhost and Termial.
Thanks for the help again!
The CRT is Microsoft's implementation of C's standard library. It implements C functions like malloc, free but also printf which is what diagon uses to print text to stdout.
You seem to be aware what code pages are so I won't explain that part.
You might be familiar with this Region control panel:
This control panel selects the value of the special CP_ACP code page (the system default code page as it says). For instance, on my PC the CP_ACP stands for code page 437 and on your PC it's 850. All narrow Windows APIs (the ones with the A at the end) use the CP_ACP. So if you call CreateFileA with a path, that path needs to be encoded in the 437 code page on my system and in 850 on yours.
The problem now is that when I say "all narrow Windows APIs" what I really meant to say is: All narrow Windows APIs except for the console APIs, because the original console API designers unfortunately often had the foresight of a brick wall.
A console application on Windows can simultaneously (!) read input in US OEM 437, write output in Latin1 850, and also read and write files in UTF8 65001. All at the same time! I can see why these 2 additional code pages where added, since it adds a ton of flexibility, but the problem you're seeing is a direct consequence of this flexibility.
Because the CRT now needs to sort of guess what code page you actually want. Given your screenshots, your version seems to always be using the CP_ACP. As explained above, this means that calling chcp won't ever have any affect on it.
I can recommend 2 solutions for this that diagon can implement. To be clear, I have not tested any of these yet:
- Try a newer Windows SDK version. Static linking with the CRT in the SDK may be needed.
- Follow this guide: https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page
This will change the
CP_ACPtoUTF-8. BUT it will not change the console code page(s). That requires an explicit call toSetConsoleOutputCPat the start of the program.