terminal icon indicating copy to clipboard operation
terminal copied to clipboard

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8

Open amyw-msft opened this issue 5 years ago • 5 comments

Environment

Microsoft Windows [Version 10.0.18363.592]

Impact

This issue is affecting reading console input via the Universal C Runtime as well - _read, getchar, fread, scanf, etc. Using _cgets_s only works around this issue because it uses ReadConsoleW instead of ReadFile. This is also reported against the UCRT on Developer Community here: _read() cannot read UTF-8 but _cgets_s() can.

Steps to reproduce

When using ReadFile to read from a console handle, UTF-8 input is not correctly returned. Using ReadFile on other types of handles (files, pipes) can read UTF-8 without issue. SetConsoleCP and SetConsoleOutputCP do not appear to affect this behavior.

C:\Users\stwish\source\read_utf8>type win32_test.cpp
#include <Windows.h>
#include <stdio.h>

int main()
{
    SetConsoleCP(65001);
    SetConsoleOutputCP(65001);
    const HANDLE console_stdin = GetStdHandle(STD_INPUT_HANDLE);

    const size_t buf_count = 20;
    char buffer[buf_count]{};

    DWORD num_read;

    BOOL result = ReadFile(
        console_stdin,
        buffer,
        buf_count,
        &num_read,
        nullptr
        );

    printf("ReadFile returned '%d'\n", result);
    for (int i = 0; i < 20; i++)
    {
        printf("%02x ", (unsigned char)buffer[i]);
    }

    return 0;
}
C:\Users\stwish\source\read_utf8>cl /nologo /EHsc /MT win32_test.cpp /Zi
win32_test.cpp
C:\Users\stwish\source\read_utf8>win32_test.exe
我是中文字符
ReadFile returned '1'
00 00 00 00 00 00 0d 0a 00 00 00 00 00 00 00 00 00 00 00 00
C:\Users\stwish\source\read_utf8>echo 我是中文字符 | win32_test.exe
ReadFile returned '1'
e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 20 0d
C:\Users\stwish\source\read_utf8>type input.txt
我是中文字符

C:\Users\stwish\source\read_utf8>type input.txt | win32_test.exe
ReadFile returned '1'
e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 00 00

Expected behavior

Running win32_test.exe and entering '我是中文字符' input on the console should return e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 0d 0a as this is the UTF-8 representation of that string, plus CR LF.

Actual behavior

Running win32_test.exe and entering '我是中文字符' input on the console will return 6 null characters and CR LF, but still returns that the read operation was successful.

amyw-msft avatar Feb 12 '20 18:02 amyw-msft

ReadFile and ReadConsoleA are currently limited to 7-bit ASCII when the input codepage is UTF-8 (65001) due to an assumption of 1 CHAR per WCHAR when calling WideCharToMultiByte.

A ordinal in the range [0x000000, 0x00FFFF], i.e. the Basic Multilingual Plane (BMP), uses a single WCHAR value. UTF-8 uses 1 byte per ASCII ordinal in the range [0x000000, 0x00007F]. UTF-8 uses 2-3 bytes per non-ASCII ordinal in the range [0x000080, 0x00FFFF]. A non-BMP ordinal in the range [0x010000, 0x10FFFF] uses two WCHAR values to store a UTF-16 surrogate pair. UTF-8 uses 4 bytes per non-BMP ordinal. (The maximum Unicode ordinal is capped by design at 0x10FFFF for compatibility with UTF-16, which uses a reserved 10-bit range in a pair of 16-bit codes to support a 20-bit space with an additional 16 supplementary planes, e.g. [0x010000, 0x01FFFF], [0x020000, 0x02FFFF], and so on up to [0x100000, 0x10FFFF].)

For non-ASCII ordinals, the internal WideCharToMultiByte call fails, and the initial null byte value is used. In Windows 10, with the new console enabled, we get 1 null byte in the result per WCHAR because it encodes one code at a time. For non-BMP ordinals, given the assumption of one CHAR per WCHAR and encoding one code at a time (i.e. naive handling of surrogate pairs), we should get two null bytes per non-BMP ordinal. However, with the new console I can't even get wide-character ReadConsoleW to work with non-BMP ordinals. It translates a UTF-16 surrogate pair to the replacement character (0x00FFFD). ReadConsoleW works fine with non-BMP ordinals in the legacy console, so some change in the new console has broken UTF-16 support in cooked reads.

For ReadFile and ReadConsoleA in older versions of Windows, or if we enable the legacy console in Windows 10, using UTF-8 as the input codepage causes a 'successful' read of 0 bytes if the read contains even one non-ASCII ordinal. Many programs interpret a successful read of 0 bytes as EOF.

eryksun avatar Feb 13 '20 00:02 eryksun

@eryksun thanks for the detailed write-up, and @stwish-msft thanks for the report. It looks like we don't actually have a bug tracking COOKED_READ not supporting UTF-8, but we do know about it. Perhaps it got lost in the transition our of Azure DevOps?

Regardless, this is now the one.

Eryk, would you mind filing a separate issue for the non-BMP ReadConsoleW bug? That one is likely more readily fixable than this one, and since it's a regression from legacy (and I've got a copy of every shipped version of conhost, so we can find out exactly when :P) it's pretty important.

DHowett-MSFT avatar Feb 13 '20 08:02 DHowett-MSFT

I'm giving this one the unusual denomination of "bugtask". We have a couple of them -- it's a bug, yes, but it's a fairly big chunk of work and a new feature to boot. :smile:

DHowett-MSFT avatar Feb 13 '20 08:02 DHowett-MSFT

Hi, I'm assuming that by "cooked" you mean that the following are enabled in the console mode on stdin: ENABLE_ECHO_INPUT ENABLE_LINE_INPUT ENABLE_PROCESSED_INPUT Is that right?

For me, the problem occurs even with those disabled.

Here is my test code:

#include <stdio.h>
#include <windows.h>

int main(void)
{
    // Set UTF-8 code page for input and output
    if (!SetConsoleOutputCP(65001))
        return 1;

    if (!SetConsoleCP(65001))
        return 2;

    printf("Output code page is %d, input code page is %d\n",
        (int)GetConsoleOutputCP(), (int)GetConsoleCP());

    puts(u8"We can output utf-8: γατάκι");

    HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);
    if (hStdin == INVALID_HANDLE_VALUE || hStdin == NULL)
        return 3;

    // Set input mode
    DWORD mode;
    if (!GetConsoleMode(hStdin, &mode))
        return 4;
    mode &= ~(ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT);
    if (!SetConsoleMode(hStdin, mode))
        return 5;
    if (!GetConsoleMode(hStdin, &mode))
        return 6;
    printf("Console mode for stdin is 0x%08x\n", (int)mode);

    puts("Input is now in 'raw' mode.  Type something.");

    char b;
    do {
        DWORD numRead;
        if (!ReadFile(hStdin, &b, 1, &numRead, NULL) || numRead == 0)
            return 7;
        printf("%02x ", (int)b & 0x0ff);
    } while (b != 'q');

    return 0;
}

If I run this Windows Terminal running cmd.exe and paste the text "I8Σπ q" when prompted, the output looks like this:

C:\code\c\vt_experiments>test
Output code page is 65001, input code page is 65001
We can output utf-8: γατάκι
Console mode for stdin is 0x000001f0
Input is now in 'raw' mode.  Type something.
49 38 00 00 20 71
C:\code\c\vt_experiments>

You can see that Σ and π are read as zeros.

Conhost running cmd.exe is the same except that the reported console mode is 0x1b0.

Some of the set flags in the console modes 0x1f0 or 0x1b0 seem to be undocumented, or am I reading them wrong? Console mode flags reference: https://docs.microsoft.com/en-us/windows/console/high-level-console-modes

Instead of ReadFile() I've also tried getchar(), scanf_s(), fgetc() - none of them worked either.

I'm new to console programming, so sorry if I've overlooked something that should be obvious.

Thanks!

Microsoft Visual Studio Community 2019 Version 16.8.3 Microsoft Windows [Version 10.0.19041.685] Windows Terminal Version: 1.4.3243.0

clinton-r avatar Dec 22 '20 06:12 clinton-r

It's about 2 years later now and this seems a tough problem. One reason why Windows Console sucks is that it cannot support UTF-8 input natively, and requires developers to use annoying ReadConsoleW and to reinvent a complex UTF-16 to UTF-8 translation state machine once they encountered this bug. Sad.

defrag257 avatar Dec 15 '21 13:12 defrag257