Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese).
Environment
Windows build number: Microsoft Windows [Version 10.0.18363.1082]
Windows Terminal version (if applicable): 1.2.2381.0
Steps to reproduce
- Compile the code below.
/* test-prog.c */
#include <windows.h>
#include <stdio.h>
int main()
{
unsigned char buf[256];
DWORD len;
int i;
ReadFile(GetStdHandle(STD_INPUT_HANDLE), buf, sizeof(buf), &len, NULL);
for (i=0; i<len; i++) printf("%02x ", buf[i]);
printf("\n");
return 0;
}
- Open Windows Terminal with cmd.exe or powershell.exe.
- Run chcp 932.
- Run the test-prog above.
- Enter 'α' + [enter key].
Expected behavior
If the test-prog above is executed in command prompt, the multibyte char 'α' can be read correctly as follows.
C:\Test>test-prog
α
83 bf 0d 0a
Actual behavior
If the test-prog above is executed in Windows Terminal, the multibyte char 'α' cannot be read correctly as follows.
C:\Test>test-prog
α
00 0d 0a
This also happens with getchar(), getwchar(), gets(), ReadConsoleA(), etc. As far as I tested, only ReadConsoleW() can read 'α' correctly.
Most of Japanese multibyte chars can be read correctly with the code above as follows
C:\Test>test-prog
あ
82 a0 0d 0a
even in Windows Terminal.
Supplement
Similar happens under CP936 (Simplified Chinese) and CP949 (Korean).
I'm pretty sure that's just how codepages work, though @miniksa can correct me if I'm wrong. I think if you want to use ReadFile (NOT ReadFileW), you'll probably want to stick to UTF-8 (codepage 65001).
Did this ever work in previous versions of Windows?
So it looks like this actually does work outside of WT (!)
Well that's unfortunate. I wonder if it's a bad MB2WC, or maybe bad input events getting written via win32 input mode. Hmmmmm.
I think this is ConPTY issue because this can be reproducible also in WSL outside Windows Terminal.
Did this ever work in previous versions of Windows?
What do you mean by previous versions? Such as Windows 7 or Windows 8.1? Or Windows 10 1903 or earlier? Since Windows Terminal supports only in Windows 1903 or later, This cannot be reproducible in Windows 7 and 8.1.
This issue can be reproduced with ConPTY in Windows 10 1809.
I think if you want to use
ReadFile(NOTReadFileW), you'll probably want to stick to UTF-8 (codepage 65001).
In which version(s) of Windows and Windows Terminal, or with which forms of input, should one expect UTF-8 to work correctly as the input codepage? It's not working with typed and pasted input with conhost.exe in Windows 10 2004, or with openconsole.exe in Windows Terminal Preview 1.4.2652.0. Is there a separate code path for East-Asian locales that use IME input, and does it work correctly in that case?
For example, in Python's REPL shell, under Windows Terminal, the following uses ctypes (libffi) to directly call ReadConsoleA and ReadFile:
>>> import ctypes
>>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
>>> kernel32.SetConsoleCP(65001)
1
>>> kernel32.GetConsoleCP()
65001
>>> h = kernel32.GetStdHandle(-10)
>>> buf = (ctypes.c_char * 10)()
>>> pn = ctypes.pointer(ctypes.c_ulong())
>>> kernel32.ReadConsoleA(h, buf, 10, pn, None)
ab_α_ba
1
>>> buf[:9]
b'ab_\x00_ba\r\n'
>>> kernel32.ReadFile(h, buf, 10, pn, None)
cd_α_dc
1
>>> buf[:9]
b'cd_\x00_dc\r\n'
Note that in both cases "α" gets replaced with a null byte.
Setting the output codepage to UTF-8 started working correctly with WriteFile and WriteConsoleA in Windows 8, due to the required rewrite when the old LPC-based 'files' were replaced by real files provided by the ConDrv device. With the old implementation in Windows 7 and earlier, writing UTF-8 returned the number of decoded wide characters written instead of the number of bytes written. The latter caused buffered writers to repeat a write multiple times, leading to the display of garbage text after every write with non-ASCII characters.
Setting the input codepage to UTF-8 has never worked correctly with ReadFile and ReadConsoleA in a Western locale in any version of Windows or the console host that I've used. It's limited to reading 7-bit ASCII (i.e. ordinals 0-127). Each non-ASCII character gets replaced by a null character. Since chcp.com sets both the input and output codepages, the advice to use chcp 65001, which gets repeatedly online ad nauseum, is generally misguided. It's a bad choice for people who need to read non-English input (e.g. a Spanish locale), either typed or pasted into the console. It's fine in Windows 8+ if just the output codepage is set via SetConsoleOutputCP(CP_UTF8).
Part 1:
This is strictly between 437 and 932 per the original filer.
I have this spreadsheet from the investigation of how it works in classic conhost since that's what services these API calls:

It looks like there is a regression in having the input codepage as 932 and the output codepage as 437... that's a scenario that returns null in the new conhost but not in the legacy conhostv1.dll. So I need to get that fixed.
Part 2:
For Windows Terminal, which uses ConPTY, which tries to use UTF-8/65001 for everything...
@DHowett noticed this https://github.com/microsoft/terminal/blob/55151a4a04ea011b42993e71cff1afef2f121af8/src/terminal/adapter/InteractDispatch.cpp#L60-L68
Where the interactive dispatcher (input for virtual terminal) is using the OUTPUT codepage to determine what to encode with. That's clearly not right either.
Summary so far:
- I have a test/investigation table for conhost, but no line of code to point to as broken
- I have a line of code broken in conpty (for terminal), but don't have a specific test/investigation for it yet.
I will continue.
Part 1:
The function where this converts the α into 0x00 instead of the two byte sequence 0x83 0xbf is at https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/misc.cpp#L271.
The α is given to WideCharToMultiByte as 1 byte and asked to convert to codepage 932 with only 1 byte of available output space. Since it obviously needs 2 in that code page, it's converted to 0x00.
- It should probably be getting converted to the appropriate replacement character for the codepage (which I believe we can lookup) instead of to 0x00.
- It should be given 2 bytes of space, not 1, because that's how much it needs.
I'm going to explore item 2 further.
Part 1, Item 2
The stack where this occurs is
> OpenConsole.exe!ConvertToOem(const unsigned int uiCodePage, const wchar_t * const pwchSource, const unsigned int cchSource, char * const pchTarget, const unsigned int cchTarget) Line 272 C++
OpenConsole.exe!TranslateUnicodeToOem(const wchar_t * pwchUnicode, const unsigned long cchUnicode, char * pchAnsi, const unsigned long cbAnsi, std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>> & partialEvent) Line 208 C++
OpenConsole.exe!COOKED_READ_DATA::_handlePostCharInputLoop(const bool isUnicode, unsigned __int64 & numBytes, unsigned long & controlKeyState) Line 1176 C++
OpenConsole.exe!COOKED_READ_DATA::Read(const bool isUnicode, unsigned __int64 & numBytes, unsigned long & controlKeyState) Line 451 C++
OpenConsole.exe!COOKED_READ_DATA::Notify(const WaitTerminationReason TerminationReason, const bool fIsUnicode, long * const pReplyStatus, unsigned __int64 * const pNumBytes, unsigned long * const pControlKeyState, void * const __formal) Line 411 C++
OpenConsole.exe!ConsoleWaitBlock::Notify(const WaitTerminationReason TerminationReason) Line 152 C++
OpenConsole.exe!ConsoleWaitQueue::_NotifyBlock(ConsoleWaitBlock * pWaitBlock, const WaitTerminationReason TerminationReason) Line 117 C++
OpenConsole.exe!ConsoleWaitQueue::NotifyWaiters(const bool fNotifyAll, const WaitTerminationReason TerminationReason) Line 90 C++
OpenConsole.exe!ConsoleWaitQueue::NotifyWaiters(const bool fNotifyAll) Line 65 C++
OpenConsole.exe!InputBuffer::WakeUpReadersWaitingForData() Line 165 C++
OpenConsole.exe!InputBuffer::Write(std::deque<std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>>,std::allocator<std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>>>> & inEvents) Line 581 C++
OpenConsole.exe!InputBuffer::Write(std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>> inEvent) Line 539 C++
OpenConsole.exe!HandleGenericKeyEvent(KeyEvent keyEvent, const bool generateBreak) Line 165 C++
OpenConsole.exe!HandleKeyEvent(HWND__ * const hWnd, const unsigned int Message, const unsigned __int64 wParam, const __int64 lParam, int * pfUnlockConsole) Line 467 C++
OpenConsole.exe!Microsoft::Console::Interactivity::Win32::Window::ConsoleWindowProc(HWND__ * hWnd, unsigned int Message, unsigned __int64 wParam, __int64 lParam) Line 523 C++
OpenConsole.exe!Microsoft::Console::Interactivity::Win32::Window::s_ConsoleWindowProc(HWND__ * hWnd, unsigned int Message, unsigned __int64 wParam, __int64 lParam) Line 58 C++
user32.dll!UserCallWinProcCheckWow(_ACTIVATION_CONTEXT * pActCtx, __int64(*)(tagWND *, unsigned int, unsigned __int64, __int64) pfn, HWND__ * hwnd, _WM_VALUE msg, unsigned __int64 wParam, __int64 lParam, void * fEnableLiteHooks, int) Line 280 C++
user32.dll!DispatchMessageWorker(tagMSG * pmsg, int fAnsi) Line 3157 C++
OpenConsole.exe!ConsoleInputThreadProcWin32(void * __formal) Line 1082 C++
kernel32.dll!BaseThreadInitThunk(unsigned long RunProcessInit, long(*)(void *) StartAddress, void * Argument) Line 70 C
ntdll.dll!RtlUserThreadStart(long(*)(void *) StartAddress, void * Argument) Line 1152 C
One frame up Code reference: https://github.com/microsoft/terminal/blob/09471c3753d888abcfd160dae524988003da862a/src/host/dbcs.cpp#L157-L208
The length of space given to ConvertToOem is 1 when IsGlyphFullWidth returns false on the given character α. Otherwise, two bytes are given as the length of the array BYTE AsciiDbcs[2];
I momentarily drilled into IsGlyphFullWidth to check what it's doing before realizing that the full-width or half-widthness of a glyph really has nothing to do with how many bytes it's going to be consuming when returned to the user. But it does lead me down one potential revelation: When full width is checked and we don't have an answer for it in the table, we ask the font. The font chosen in conhost varies between choosing Output Codepage 437 and Output Codepage 932.
The questions then are:
A. Is there a divergence here between output CP 437 and output CP 932 in respect to what IsGlyphFullWidth reports?
B. Why is this using the glyph width to determine whether it fits? (Shouldn't it just use whatever is necessary?)
C. How is this different from conhostv1.dll because it worked there? (Was it checking width? Did it treat fonts differently?)
Part 1, Item 2, Questions
A. Is there a divergence here between output CP 437 and output CP 932 in respect to what IsGlyphFullWidth reports?
No. That's not the issue. It's the font that's the issue. There's two circumstances I've now observed in conhostv1.dll, but keep in mind that it tends to choose slightly different fonts than the most recent one because we've unlocked the ability to switch codepages early in the conhostv2 journey (as the locking to a specific one, the OEMCP one from startup for DBCS codepages, was a relic from when we sold a "Japanese Edition" of Windows or a "Chinese Edition" of Windows.)
If the font says that the α is wide, it gets translated into two bytes and fits and we see the 0x83 0xbf answer.
If the font says that the α is narrow, it literally returns one byte of uninitialized memory! The code didn't check the return code from the conversion and just walks on as if 1 byte was converted!
B. Why is this using the glyph width to determine whether it fits? (Shouldn't it just use whatever is necessary?)
As far as I can tell, this is the long standing mix about that someone did long ago when implementing this. The terminology of "full width" and "double byte" and the number of CHARs used to store something are all sort of mixed up as the same thing even though they can vary dramatically. For instance, a narrow character can be 2 bytes in a DBCS codepage.
C. How is this different from conhostv1.dll because it worked there? (Was it checking width? Did it treat fonts differently?)
It's different because it was using RtlUnicodeToOemN when the system startup OEMCP matched the currently chosen input code page. For conhostv2.dll, the current conhost.exe, and openconsole.exe, we run everything through the WideCharToMultiByte function instead, deduplicating the code path (in theory) AND we tend to end up with a different font that is saying this is narrow because of the broader unlocked choice. Finally, the v2 one actually checks error return codes and tries to compensate for them where v1 just returns either uninitialized memory or just so happens to have enough memory for the conversion half by fluke due to the conflation of the "wide" and "2 byte" terminology.
Conclusion
Due to the total mix around of DBCS and wide/narrow and 1/2 bytes in conhostv1 combined with the fact that certain fonts returned uninitialized memory, the v1 variant is not a model for anyone to follow as sanity.
I believe that the fix here is to make the v2 one not necessarily check the "width" of the character as a measure of how many bytes it will take and instead just do the conversion filling as much buffer as it has access to over returning a \0 as failure (and stow the leftover bytes for the next call, per the read model of many other ReadConsole* family functions.)
Okay, so I have a proposed fix for the translation part in ConvertUnicodeToOem that:
- Doesn't try to count the width of characters in bytes or the widthness of a byte. It just converts up to the available space given.
- Uses the default character if the conversion fails instead of leaving behind a null byte (or the even worse
v1policy of an uninitialized byte.)
It's here: https://github.com/microsoft/terminal/commit/c1ca8f346d74f3b9ca7a1aa379ec740e1f62d31c
But now there's another problem, the parent calling it isn't giving enough buffer space. And that isn't because the user buffer provided in the API call was necessarily too small, it's because the parent function _handlePostCharInputLoop is pre-guessing how many bytes it will take by checking the WIDTH of every character and reserving two codepage bytes for a wide and one for a narrow. https://github.com/microsoft/terminal/blob/9b92986b49bed8cc41fde4d6ef080921c41e6d9e/src/host/readDataCooked.cpp#L1003-L1199
That sort of behavior is standard practice for COOKED READ.... that is guessing byte count based on character width which can be dependent on the font chosen which can be dependent on the code page selected. Gross.
So my plan is to try to make a targeted fix in _handlePostCharInputLoop for this for now, but we're going to need to book all up COOKED READ rework soonish because very similar issues are impacting every other issue on the tracker with "cooked read", "utf8 and read*a APIs", and "emoji in cmd.exe" like this one #1503.
Also.... ConvertUnicodeToOem might just be completely unnecessary as it sounds an awful lot like some of our other ConvertToA like functions and/or the til::u16a and friends from #4493 that I might have an opportunity to resurrect (@german-one, fyi).
I wrote a few tests here to cover this as I work on fixing it:
https://github.com/microsoft/terminal/blob/8561bd217dea3e8c77f281ba70d1b181069e95a7/src/host/ft_host/API_AliasTests.cpp
(@german-one, fyi)
Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of operator(). That's either because I didn't find the spec or because it's quite tricky (like for UTF-7 which is a mix of ASCII and base64).
(@german-one, fyi)
Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of
operator(). That's either because I didn't find the spec or because it's quite tricky (like for UTF-7 which is a mix of ASCII and base64).
That's fine. Thank you!