terminal icon indicating copy to clipboard operation
terminal copied to clipboard

WriteConsoleOutputW doesn't work with wide chars and surrogate pairs in Windows Terminal

Open BDisp opened this issue 7 months ago โ€ข 11 comments

Windows Terminal version

1.22.11141.0

Windows build number

10.0.26100.0

Other Software

cmd.exe, conhost.exe and Windows Terminal

Steps to reproduce

Run this code in cmd or conhost and compare using with Windows Terminal. The output are different in the Windows Terminal

#include <windows.h>
#include <iostream>

int main() {
    // 1) Enable UTF-8 output so Unicode glyphs display correctly
    if (!SetConsoleOutputCP(CP_UTF8)) {
        std::cerr << "SetConsoleOutputCP failed\n";
        return 1;
    }

    // 2) Get handle and current buffer info
    HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
    CONSOLE_SCREEN_BUFFER_INFO csbi;
    if (!GetConsoleScreenBufferInfo(hConsole, &csbi)) {
        std::cerr << "GetConsoleScreenBufferInfo failed\n";
        return 1;
    }

    SHORT cursorX = csbi.dwCursorPosition.X;
    SHORT cursorY = csbi.dwCursorPosition.Y + 1;

    // 3) Prepare 2ร—4 CHAR_INFO buffer for our two lines
    const SHORT width = 4, height = 2;
    CHAR_INFO buffer[height][width] = {};

    // Line 1: '็ณŠ' ยท '1'
    buffer[0][0].Char.UnicodeChar = L'\u7CCA';
    buffer[0][1].Char.UnicodeChar = L'\0';
    buffer[0][2].Char.UnicodeChar = L' ';
    buffer[0][3].Char.UnicodeChar = L'1';

    // Line 2: ๐Ÿ‘จ (surrogate pair) ยท ' ' ยท '2'
    buffer[1][0].Char.UnicodeChar = 0xD83D;  // High-surrogate
    buffer[1][1].Char.UnicodeChar = 0xDC68;  // Low-surrogate
    buffer[1][2].Char.UnicodeChar = L' ';
    buffer[1][3].Char.UnicodeChar = L'2';

    // Fill attributes (white on default background)
    WORD white = FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE;
    for (int y = 0; y < height; ++y)
        for (int x = 0; x < width; ++x)
            buffer[y][x].Attributes = white;  // :contentReference[oaicite:7]{index=7}

    // 4) Define output region: width=4, height=2, at (cursorX, cursorY-1)
    SMALL_RECT writeRegion = {
        cursorX,
        static_cast<SHORT>(cursorY - 1),
        static_cast<SHORT>(cursorX + width - 1),
        static_cast<SHORT>(cursorY - 2 + height)
    };
    COORD bufSize = { width, height };
    COORD bufCoord = { 0, 0 };

    // 5) Write our 2ร—4 block into the console buffer
    if (!WriteConsoleOutputW(
        hConsole,
        reinterpret_cast<CHAR_INFO*>(buffer),
        bufSize,
        bufCoord,
        &writeRegion
    )) {
        std::cerr << "WriteConsoleOutputW failed\n";
        return 1;
    }

    // 6) Compute the new cursor position:
    COORD newPos;
    newPos.X = 0;                                // or wherever you want the prompt to start
    newPos.Y = static_cast<SHORT>(cursorY + 1); // exactly three lines below your original cursor

    // 2) Move the cursor there:
    SetConsoleCursorPosition(hConsole, newPos);

    // 7) Now return โ€“ the next thing you see will be the shell prompt at newPos.
    return 0;
}

Expected Behavior

Correct with cmd and conhost:

Image

Actual Behavior

Wrong with Windows Terminal:

Image

In the first line it add a space after the ็ณŠ and in the second line it add 2 ReplacementChar and doesn't check if the first char is a high surrogate which could be a surrogate pair.

BDisp avatar Apr 26 '25 23:04 BDisp

We've found some similar issues:

  • #10287 , similarity score: 84%
  • #10810 , similarity score: 84%
  • #10055 , similarity score: 83%

If any of the above are duplicates, please consider closing this issue out and adding additional context in the original issue.

Note: You can give me feedback by ๐Ÿ‘ or ๐Ÿ‘Ž this comment.

similar-issues-ai[bot] avatar Apr 26 '25 23:04 similar-issues-ai[bot]

This is related with WriteConsoleOutputW and the other open issue is related with ReadConsoleOutputW.

BDisp avatar Apr 26 '25 23:04 BDisp

For row 1: Inserting half of a wide character into a single cell is unspecified. You are not supposed to do that. Well-formed applications do not do so. :)

DHowett avatar Apr 26 '25 23:04 DHowett

That is just for legacy terminals which they can interpret that if a cell has null char then output an empty string letting the wide char on the left output the wide char.

BDisp avatar Apr 26 '25 23:04 BDisp

But you aren't inserting a null character. You're inserting a space!

Anyway, WriteConsoleOutputCharacter[AW] is for legacy consoles that only support UCS-2 (not surrogate pairs) anyway, so it is fitting that the issues it has are issues for legacy consoles. ๐Ÿ™‚

DHowett avatar Apr 27 '25 00:04 DHowett

But you aren't inserting a null character. You're inserting a space!

Sorry I already change the code. The result is the same.

Anyway, WriteConsoleOutputCharacter[AW] is for legacy consoles that only support UCS-2 (not surrogate pairs) anyway, so it is fitting that the issues it has are issues for legacy consoles. ๐Ÿ™‚

But with the legacy consoles are working as expected. ๐Ÿ˜ƒ

BDisp avatar Apr 27 '25 00:04 BDisp

I'm the person who advocated for and made the change that broke your code. This happened in #17510.

As you may already know, historically the console on Windows was essentially nothing but a

CHAR_INFO buffer[height][width];

with a GDI window attached to it. The way WriteConsoleOutput worked, was that it literally just iterated over the given area and copied your given CHAR_INFOs over. It didn't even do any input validation beyond that. This allowed you to smuggle hidden data into the console buffer that the user couldn't see, among others. (For what it's worth, this is not a security issue IMO. I mention it as it highlights how "direct" the console APIs were.)

This API design fundamentally conflicts with modern Unicode, with its complex languages, combining marks, or emojis. Newer versions of both Windows Terminal and the old console store text like this (figuratively speaking):

std::wstring text[height]; // each row can be an arbitrarily long string
WORD attributes[height][width];

There's no matrix of cells anymore and in fact it can't have one: Those combining marks and emojis can easily rack up 20+ characters for a single single cell. This makes dynamic width rows necessary.

But even just surrogate pairs already break the CHAR_INFO model. You can't store a surrogate pair in a single column with these APIs. You also can't read or write them using non-CHAR_INFO-APIs if you did store them with 2 CHAR_INFOs. The entire Console API has always been inconsistently broken when it came to anything beyond UCS2.

This is why I have continuously worked towards a future where all CHAR_INFO APIs strictly support only UCS2 (no surrogate pairs) or DBCS (ShiftJIS, etc.), just like how it worked up until roughly Windows 10 1607. #17510 is just one more step towards this and in the future your code will also stop working in the old console window. To write surrogate pairs you must use non-CHAR_INFO APIs.

You may say that I'm breaking compatibility with existing applications, and you'd be correct. I'm carefully breaking some situations, in order for Windows overall being able to move towards a better future. A future with more features (full grapheme cluster support in 1.22), better performance (~25x at this point!) and less bugs (can't even count them anymore). Arguably that's one of the largest user asks for Windows and has been for a long time. Strictly speaking, it also wouldn't break "legacy" applications per-se, because those predate the addition of surrogate pairs to the console IMO.

lhecker avatar Apr 27 '25 16:04 lhecker

Leonard wrote:

and in the future your code will also stop working in the old console window

@BDisp If for some reason you need such functionality, you could play with vtm (it's still under development, but just in case). Run vtm -r term to run its built-in terminal.

It can output independent cells, even if they are halves of wide characters (discussed in #4345). In vtm, your code will output half of the hieroglyph, but with a minor edit:

    // Line 1: '็ณŠ' ยท '1'
    buffer[0][0].Char.UnicodeChar = L'\u7CCA';
    buffer[0][1].Char.UnicodeChar = L'\u7CCA';
    buffer[0][2].Char.UnicodeChar = L' ';
    buffer[0][3].Char.UnicodeChar = L'1';

you will get the full character: Image

o-sdn-o avatar Apr 27 '25 21:04 o-sdn-o

Including the same wide character in both halves is what is required for this to operate as expected on all versions of the Windows Console as well.

DHowett avatar Apr 27 '25 22:04 DHowett

I really appreciated all yours feedback and I understand that new features can cause break changes. Thus, I decided to use, for virtual terminal sequences that are disabled, WriteConsoleW and WriteConsoleOutputAttribute which work perfectly with cmd, conhost and Windows Terminal. I know that it's limited to 16 colors but with wide char and surrogate pair support. Do you want I close this issue? Thanks.

BDisp avatar Apr 27 '25 22:04 BDisp

(To be clear, I meant my comment about "your code will also stop working in the old console window" specifically regarding surrogate pairs. Most wide characters are in the BMP and thus regular UCS2.)

WriteConsoleOutputAttribute is actually also a CHAR_INFO-related API. For instance, it technically allows you to colorize halves of wide cells (we just don't support that right now as a technical limitation). Because of this, I actually expected it to not work for colorizing surrogate pairs in Windows Terminal 1.22. Does that really work for you?

If you don't mind, can you tell us more about the software you're developing and what versions of Windows you're targeting? My understanding was that proper support for surrogate pairs was added to conhost (the console) only after it got initial support for VT sequences in Windows 10 10586 (in 2015). This would mean that either CHAR_INFOs are fine to use as-is, because you don't need surrogate pairs, or that you can use VT sequences together with surrogate pairs.

I wouldn't close this issue just yet. Even though I wrote all the above, my goal is still to give you a proper solution. ๐Ÿ™‚

lhecker avatar Apr 28 '25 00:04 lhecker

Thanks for all your feedback. WriteConsoleOutputAttribute really don't work. I wasn't testing with colors but only using the console default attributes. Setting custom foreground colors still work but also setting background colors it only print in the first line and Windows Terminal don't persist them. Thus, the work around is to use SetConsoleTextAttribute which really does what it's expected and work fine with wide chars and with surrogate pairs in all console. Below is the output and my changed code for all this work. I hope at least this doesn't broke in the future. I know that using VT sequences works better but my solution is for consoles that don't use VT sequences and allow to deal with wide chars and surrogate pairs with 16 colors.

Here is the output:

  • cmd and conhost:

Image

  • Windows Terminal:

Image

Here is my current code:

#include <windows.h>
#include <iostream>
#include <vector>
#include <string>
#include "NativeExports.h"

bool IsVirtualTerminalEnabled(HANDLE hConsole) {
	DWORD mode = 0;
	if (!GetConsoleMode(hConsole, &mode))
		return false; // Handle not valid or error

	return (mode & ENABLE_VIRTUAL_TERMINAL_PROCESSING) != 0;
}

/// Ensures the console screen buffer is at least (minCols ร— minRows).
/// Returns true on success, false on failure.
bool EnsureBufferSize(HANDLE hConsole, SHORT minCols, SHORT minRows) {
	CONSOLE_SCREEN_BUFFER_INFO csbi;
	if (!GetConsoleScreenBufferInfo(hConsole, &csbi)) return false;

	// If buffer is already big enough, nothing to do
	if (csbi.dwSize.X >= minCols && csbi.dwSize.Y >= minRows)
		return true;

	// Compute new size
	COORD newSize = csbi.dwSize;
	newSize.X = std::max<SHORT>(newSize.X, minCols);
	newSize.Y = std::max<SHORT>(newSize.Y, minRows);

	// Grow the buffer
	if (!SetConsoleScreenBufferSize(hConsole, newSize))
		return false;
	return SetConsoleCursorPosition(hConsole, { 0, static_cast<SHORT>(newSize.Y) });
}

const int H = 3; // Height of the buffer
const int W = 4; // Width of the buffer

// Represents a contiguous run of characters sharing the same attribute
struct Run {
	WORD attr;          // Color attribute
	std::wstring text;  // UTF-16 text
};

// Splits a single row of CHAR_INFO cells into runs
std::vector<Run> BuildRunsForRow(const CHAR_INFO* rowBuf, int width) {
	std::vector<Run> runs;
	if (width <= 0) return runs;

	int startCol = 0;
	while (startCol < width && rowBuf[startCol].Char.UnicodeChar == L'\0') {
		++startCol;
	}
	if (startCol >= width) {
		runs.push_back({ 0, L"\n" });
		return runs;
	}

	Run current{ rowBuf[startCol].Attributes, L"" };
	int cellCount = 0;

	for (int c = startCol; c < width; ++c) {
		wchar_t ch = rowBuf[c].Char.UnicodeChar;
		if (ch == L'\0') continue;  // skip flags

		// Check for surrogate pairs
		if (0xD800 <= ch && ch <= 0xDBFF) { // High surrogate
			if ((c + 1) < width) {
				wchar_t ch2 = rowBuf[c + 1].Char.UnicodeChar;
				if (0xDC00 <= ch2 && ch2 <= 0xDFFF) { // Low surrogate
					if (rowBuf[c].Attributes != current.attr) {
						runs.push_back(current);
						current = { rowBuf[c].Attributes, L"" };
					}
					int codepoint = ((ch - 0xD800) << 10)
						+ (ch2 - 0xDC00) + 0x10000;
					// wcwidth on UTF-32 code point? fallback to wcwidth on ch,low
					int w = GetWidth(codepoint);
					cellCount += (w < 0 ? 1 : w);
					// append both code units to text
					current.text.push_back(ch);
					current.text.push_back(ch2);
					++c; // Skip the next cell
					continue;
				}
			}
		}

		// normal char
		if (rowBuf[c].Attributes != current.attr) {
			runs.push_back(current);
			current = { rowBuf[c].Attributes, L"" };
		}
		current.text.push_back(ch);
		// Normal BMP character:
		int w = GetWidth(ch);	// 0,1,2 or -1
		if (w < 0) w = 1;				// fallback so nothing disappears
		cellCount += w;
	}

	// if we consumed fewer cells than width, pad with spaces
	while (cellCount < width) {
		current.text.push_back(L' ');
		++cellCount;
	}

	runs.push_back(current);
	runs.push_back({ 0, L"\n" });
	return runs;
}

// Example usage for all rows in a Hร—W buffer
void WriteBufferToConsole(CHAR_INFO buf[H][W], HANDLE hConsole, SHORT cursorX, SHORT cursorY) {
	for (int r = 0; r < H; ++r) {
		auto runs = BuildRunsForRow(buf[r], W);
		COORD pos = { cursorX, static_cast<SHORT>(cursorY + r) };
		SetConsoleCursorPosition(hConsole, pos);

		DWORD written;
		for (auto& run : runs) {
			SetConsoleTextAttribute(hConsole, run.attr);
			WriteConsoleW(
				hConsole,
				run.text.c_str(),
				static_cast<DWORD>(run.text.size()),
				&written,
				nullptr
			);
		}
	}
}

int main() {
	// 1) Enable UTF-8 for Unicode
	SetConsoleOutputCP(CP_UTF8);

	// 2) Get console handle & cursor
	HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);

	if (IsVirtualTerminalEnabled(hConsole)) {
		std::cout << "VT sequences are ENABLED.\n";
	}
	else {
		std::cout << "VT sequences are DISABLED.\n";
	}

	CONSOLE_SCREEN_BUFFER_INFO csbi;
	GetConsoleScreenBufferInfo(hConsole, &csbi);

	// Store the original text attributes
	WORD originalAttributes = csbi.wAttributes;
	// Store the original cursor position
	SHORT cursorX = csbi.dwCursorPosition.X;
	SHORT cursorY = csbi.dwCursorPosition.Y;

	// 3) Master CHAR_INFO buffer (2 rows ร— 4 cols)
	CHAR_INFO buf[H][W] = {};

	// Line 1: ็ณŠ 1
	buf[0][0].Char.UnicodeChar = L'\u7CCA'; // ็ณŠ occupies 2 columns
	buf[0][1].Char.UnicodeChar = L'\0';     // flag to skip in Char-path only
	buf[0][2].Char.UnicodeChar = L' ';
	buf[0][3].Char.UnicodeChar = L'1';

	// Line 2: ๐Ÿ‘จ 2
	buf[1][0].Char.UnicodeChar = 0xD83D;    // High surrogate (๐Ÿ‘จ) occupies 2 columns
	buf[1][1].Char.UnicodeChar = 0xDC68;    // Low surrogate
	buf[1][2].Char.UnicodeChar = L' ';
	buf[1][3].Char.UnicodeChar = L'2';

	// Line 3: ๐”ฝ 3
	buf[2][0].Char.UnicodeChar = 0xD835;    // High surrogate (๐”ฝ) occupies 1 column
	buf[2][1].Char.UnicodeChar = 0xDD3D;    // Low surrogate
	buf[2][2].Char.UnicodeChar = L' ';
	buf[2][3].Char.UnicodeChar = L'3';

	// Foreground color per cell
	WORD fgColors[H][W] = {
		{ FOREGROUND_RED | FOREGROUND_INTENSITY, FOREGROUND_GREEN | FOREGROUND_INTENSITY, FOREGROUND_BLUE | FOREGROUND_INTENSITY, FOREGROUND_RED | FOREGROUND_GREEN },
		{ FOREGROUND_GREEN | FOREGROUND_BLUE, FOREGROUND_RED | FOREGROUND_BLUE, FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE, FOREGROUND_RED },
		{ FOREGROUND_GREEN, FOREGROUND_BLUE, FOREGROUND_RED | FOREGROUND_GREEN, FOREGROUND_GREEN | FOREGROUND_BLUE }
	};

	// Background color per cell
	WORD bgColors[H][W] = {
		{ BACKGROUND_BLUE, BACKGROUND_RED, BACKGROUND_GREEN, BACKGROUND_BLUE | BACKGROUND_RED },
		{ BACKGROUND_GREEN | BACKGROUND_BLUE, BACKGROUND_RED | BACKGROUND_BLUE, BACKGROUND_RED | BACKGROUND_GREEN | BACKGROUND_BLUE, BACKGROUND_RED },
		{ BACKGROUND_GREEN, BACKGROUND_BLUE, BACKGROUND_RED | BACKGROUND_GREEN, BACKGROUND_GREEN | BACKGROUND_BLUE }
	};

	// Assign attributes with different foreground and background colors
	for (int r = 0; r < H; ++r) {
		for (int c = 0; c < W; ++c) {
			// Simple fix: XOR foreground and background to ensure different values
			if ((fgColors[r][c] & 0x0F) == (bgColors[r][c] >> 4)) {
				fgColors[r][c] ^= FOREGROUND_INTENSITY;
			}
			buf[r][c].Attributes = fgColors[r][c] | bgColors[r][c];
		}
	}

	// Suppose you know youโ€™ll write up to (cursorX + 5) columns and (cursorY + 2) rows:
	EnsureBufferSize(hConsole,
		static_cast<SHORT>(cursorX + W),
		static_cast<SHORT>(cursorY + H + 2));

	WriteBufferToConsole(buf, hConsole, cursorX, cursorY);

	SetConsoleCursorPosition(hConsole, { 0, static_cast<SHORT>(cursorY + H) });

	// Restore the original text attributes
	SetConsoleTextAttribute(hConsole, originalAttributes);

	return 0;
}

Edit: I forgot to say that the code for the GetWidth function can be get in https://github.com/BDisp/WcwidthWrapper, thanks.

BDisp avatar Apr 29 '25 03:04 BDisp

	if (IsVirtualTerminalEnabled(hConsole)) {
		std::cout << "VT sequences are ENABLED.\n";
	}
	else {
		std::cout << "VT sequences are DISABLED.\n";
	}

You currently only check if VT is enabled, but you could also enable it yourself with SetConsoleMode. Is there a reason you don't do that? You could then always use VT sequences, even under the older conhost.

lhecker avatar Apr 29 '25 19:04 lhecker

Is there a reason you don't simply enable VT support with SetConsoleMode and then always use VT sequences, even under the older conhost?

As I said before my intention isn't forcing VT support but handling with a current console configuration. Windows Terminal has always VT support activated and so with no problem to deal with VT sequences. It's more for the cases when we want to use without VT support or with remote terminal via SSH, low console resources, etc. I think it isn't a big problem leaving some legacy API work without VT support. The code above is working great without VT support. I recognize that not justify to waste time with WriteConsoleOutputW, ReadConsoleOutputW, etc. But what it's working well now were good to leave them as is and leave the user to manage code to it work minimally. So, I don't have no problem working with enable VT support and I like very much. My concern wasn't handle with VT support in my code but only display if VT sequences is enabled or disabled.

BDisp avatar Apr 29 '25 20:04 BDisp

I understand and won't pry any further. I'll close the issue for now in that case. Please let us know if you encounter any other issues! Also, please always feel free to use our discussions section: https://github.com/microsoft/terminal/discussions

However, I'd still like to clarify that VT sequences are always supported via SSH. More importantly though, they also require significantly less console resources, counter to what you said (both less memory and less CPU). I strongly recommend using them exclusively in any future applications you may want to create. ๐Ÿ™‚

lhecker avatar Apr 29 '25 20:04 lhecker

I used work with VT support, so it isn't about it but only to see if some very old stuff still work ๐Ÿ˜„ Yes, I also agree by closing because there is no reason to break your great work for this API. Thanks.

BDisp avatar Apr 29 '25 20:04 BDisp