terminal Excise TranslateUnicodeToOem and everything it touched

The overarching intention of this PR is to improve our Unicode support. Most of our APIs still don't support anything beyond UCS-2 and DBCS sequences. This commit doesn't fix the UTF-16 support (by supporting surrogate pairs), but it does improve support for UTF-8 by allowing longer char sequences.

It does so by removing TranslateUnicodeToOem which seems to have had an almost viral effect on code quality wherever it was used. It made the assumption that all narrow glyphs encode to 1 char and most wide glyphs to 2 chars. It also didn't bother to check whether WideCharToMultiByte failed or returned a different amount of chars. So up until now it was easily possible to read uninitialized stack memory from conhost. Any code that used this function was forced to do the same "measurement" of narrow/wide glyphs, because of course it didn't had any way to indicate to the caller how much memory it needs to store the result. Instead all callers were forced to sorta replicate how it worked to calculate the required storage ahead of time. Unsurprisingly, none of the callers used the same algorithm...

Without it the code is much leaner and easier to understand now. The best example is COOKED_READ_DATA::_handlePostCharInputLoop which used to contain 3 blocks of almost identical code, but with ever so subtle differences. After reading the old code for hours I still don't get if they were relevant or not. It used to be 200 lines of code lacking any documentation and it's now 50 lines with descriptive function names. I hope this doesn't break anything, but to be honest I can't imagine anyone having relied on this unreliable cruft.

I needed some helpers to handle byte slices (gsl::span<char>), which is why a new til/bytes.h header was added. Initially I wrote a buf_writer class but felt like such a wrapper around a slice/span was annoying to use. As such I've opted for freestanding functions which take slices as mutable references and "advance" them (offset the start) whenever they're read from or written to. I'm not particularly happy with the design but they do the job.

Related to #8000 Fixes #4551 Fixes #7589

Validation Steps Performed

Unit and feature tests ✅
Far Manager ✅
Fixes test cases in #4551 and #7589 ✅

Jan 27 '23 16:01 lhecker

Also related to #4551?

Jan 27 '23 19:01 DHowett

Also fixes #7589.

Jan 31 '23 14:01 lhecker

Now after rewriting half of InputBuffer::Read I think that this is pretty solid. I've chosen not to implement support for surrogate pairs, because it was quite frankly a major pain in the 🍑 to stitch KeyEvents back together. Instead, it'd be much better to simply add support for surrogate pairs to KeyEvent itself (i.e. have it store up to 2 wchar_t), so that we don't have to do any stitching in the first place. Or if that's not possible, I'll add support for the stitching some time in the future. For this PR though it was just too much IMO.

Also:

Feb 03 '23 14:02 lhecker

Just wanted to check: With this PR merged, is the ReadFile w/ UTF-8 code page situation supposed to be improved?

Mar 30 '23 23:03 alexrp

Yes, the test case as described in #4551 is now fixed. However, I believe that "improved" is just the right choice of words: There's still no consistent internal support for surrogate pairs for instance, but this is something that I'm actively working on.

Mar 31 '23 00:03 lhecker

Has this fix shipped in Windows 11 conhost already?

May 21 '23 05:05 methane

Allegedly this will not ship for quite a while unfortunately. 10000s of lines of other changes are also waiting to be shipped, including massive correctness and performance improvements, and I'm quite excited to see when it does.

May 22 '23 11:05 lhecker

(It will ship in the Terminal in 1.18 Soon^TM though)

May 22 '23 11:05 zadjii-msft

Thank you!

May 22 '23 11:05 methane

Is this related to #12626?

Jul 27 '23 10:07 SainoNamkho

terminal terminal copied to clipboard

Excise TranslateUnicodeToOem and everything it touched

Validation Steps Performed

terminal
terminal copied to clipboard