terminal
terminal copied to clipboard
[Epic] Text Buffer rewrite
This is the issue tracking the great buffer rewrite of 202x.
Aims
- [x] Refactor to remove the need for
UnicodeStorage(which is a lookup table keyed on row+column)- Removing this allows us to remove
ROW::_id,ROW::_pParent,CharRow::_pParent
- Removing this allows us to remove
- [x] Reduce the fiddliness of the DBCS attribute APIs
- DBCS attributes are stored for every character when they could be easily inferred from column position
- [x] Add support for the storage of surrogate pairs
- Surrogate pairs work today as an accident of fate: a pair of UTF-16 code units encoding a EA=wide codepoint is seen as wide, which conveniently matches how many wchar_t it takes up.
- We have little to no proper support for a codepoint requiring two UTF-16 code units that is only seen as one column wide (#6555 (master issue), #6162 #8709)
- [x] Provide a platform on which to build full ZWJ support (#1472)
- [x] Kill
CharRow,CharRowCell,CharRowCellReference - [ ] Reduce the static storage required to store a row (eventually) by not storing space characters
- This should make MeasureRight faster, and therefore help fix #32.
Notes
Surrogate Pairs
Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.
I have done a small amount of work in WriteCharsLegacy. It is slow going.
Motivation
#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.
Implementation
The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)
In the common case (all narrow text), this is terribly wasteful.
To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).
Simple case - all glyphs narrow
CHAR A B C D
UNITS 0041 0042 0043 0044
COLS 1 1 1 1
Simple case - all glyphs wide
CHAR カ タ カ ナ
UNITS 30ab 30bf 30ab 30ca
COLS 2 2 2 2
Surrogate pair case - glyphs narrow
CHAR 🕴 🕴 🕴
UNITS d83d dd74 d83d dd74 d83d dd74
COLS 1 0 1 0 1 0
Surrogate pair case - glyphs wide
CHAR 🥶 🥶 🥶
UNITS d83e dd76 d83e dd76 d83e dd76
COLS 2 0 2 0 2 0
Representative complicated case
CHAR 🥶 A B 🕴
UNITS d83e dd76 0041 0042 d83d dd74
COLS 2 0 1 1 1 0
Representative complicated case (huge character)
[FUTURE WORK]
CHAR ﷽
UNITS fdfd
COLS 12
Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
COLS 2 0 0 0 0
A column count of zero indicates a code unit that is a continuation of an existing glyph.
Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.
Work Log
- [x] Add tests for reflow so that we can rewrite it (#8715)
- [x] Hide more of CharRow/AttrRow's implementation details inside Row (#8446)
- [ ] (from Michael)
til::rle<T, S>- a run length encoded storage template, which we will use to store column counts
Other issues that might just be fixed by this
- [ ] #8839
- [x] #11756
- [x] #32
- [ ] #6987
- [ ] #30
Alright, I've claimed this issue for the text buffer updates.
Representative complicated case (huge character)
TAB characters in https://github.com/microsoft/terminal/issues/7810 could also be stored this way.
@KalleOlaviNiemitalo Indeed! I'd thought about this when designing the new advances-based buffer, but I was a little worried about the implications. We don't have good coverage for what happens when the cursor moves into a column covered by a wide glyph, and I think nailing that down will be important for getting tab-in-buffer working.
Actually, on second thought ... this isn't as bad as I'd expected. We already move into multi-cell glyphs, and we don't damage them properly when they're overstruck. We don't render the cursor with the correct width when we're inside such a glyph either...
When the new damage and column-to-region mapping algorithms, this may be a breeze.
Will this rewrite support a TextBuffer >= 32767? Currently ROW._id is a SHORT and TextBuffer size is a COORD, which is also composed of SHORTs.
Changing TextBuffer COORD references to til::point mean a major rewrite of half the code, but it's needed if you want to support Infinite Scrollback (with files).
@DHowett I don't think this is as simple as you think. Lets say you write out the string ABCDEFGHIJ, move the cursor back to column 3 (on the character C), and then output a TAB. This moves the cursor to column 9 (on the character I), but there's no visible change to the content. In what way does the buffer now change to reflect that TAB?
Things like a zero-width-joiner are even more problematic. Let's say you write out a waving white flag (U+1F3F3), a space, and a rainbow (U+1F308), then go back and replace that space with a zero-width-joiner. The buffer sequence is now flag-zwj-rainbow, which Unicode defines as a rainbow flag. How would you alter the display to reflect those 3 characters collapsing into 1?
And whatever you do has to be compatible with the way other terminal emulators behave, and more importantly, the way apps expect the terminals to behave, otherwise they just won't work correctly in Windows Terminal. I suspect some of the things you're envisioning here are simply not possible.
@j4james I'm not terribly concerned about tab - it's an auxiliary consideration for this spec. If we wanted to replicate iTerm's behavior here, it could only replace up to the next $TABSTOP characters with a N-column \t if the moved-over portion was otherwise stored as spaces.
ZWJ is another "future consideration". I'm hoping that this is more robust for now, rather than fully scalable for the future. N:M glyph mapping, inserting ZWJs, etc. is more impossible with what we have than this specification.
I wouldn't expect overstriking a non-joining space with a joining one to ever join adjacent columns once they're frozen in place in the text buffer. The behavior would, rather, be "ZWJ only attaches characters if they are written consecutively, and a cursor move breaks a character-in-composition" or "ZWJ is zero-width and always attaches to the character stored in the prior column." Either of those would be more possible than what we're working with today. It doesn't necessarily mean they're right, it just means that our horizon expands to better consider them.
All that, though, is not in scope right now. Incremental progress :smile: while retaining compatibility with what applications (using both the Win32 Console APIs and VT) do today are my goal.
@naikel breaking any dependency we have on COORD internally is a significant bonus.
@DHowett I tried once to extend TextBuffer >= 32767 changing ROW._id from SHORT to size_t and TextBuffer COORD to til::point and after three days failed miserably. I think it's easier to just code a whole new terminal from scratch.
EDIT: I really hope you can do this!
Thoughts for replacing the global function pointer that determines glyph width (which the Renderer uses to answer for ambiguous glyphs).
struct IMeasurer {
// Measures one codepoint, stored as UTF-16 code units
virtual CodepointWidth MeasureCodepoint(std::wstring_view) = 0;
// Measures multiple codepoints, stored as a string of UTF-16 code units.
// This function should (?) return 0-width code units for combining characters
// if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
virtual std::vector<CodepointWidth> MeasureString(std::wstring_view) = 0;
}
struct UnicodeUCDMeasurer : public IMeasurer {
// implements it using raw compiled-in unicode UCD, never asks anyone else, totally static
}
struct RendererFallbackMeasurer : public IMeasurer {
IMeasurer* rootMeasurer;
RenderEngine* renderEngine;
// IFF rootMeasurer returns Ambiguous widths for any individual codepoints, ask the render engine
// This is only required for conhost, where we **must** maintain compatibility with "font dictates display width"
}
// This function should (?) return 0-width code units for combining characters // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
Can I ask what Renzhi's algorithm is? My concern is if it doesn't match the behaviour of the wcwidth routines that Linux apps often use, then those apps are just not going to work correctly in WT, no matter how brilliantly we render zero-width joiners.
Also how does this relate to what is stored in the buffer? Is the intention to measure before deciding how something gets stored, or is the measuring to determine how the buffer is rendered? The reason I ask, is because if element N in the row buffer doesn't map to column N on the display, then some VT operations will also not work correctly.
This can even be a problem for client-side operations, like search and selection, although that at least is something we could maybe compensate for. And I know the DX renderer already has issues with the buffer to screen mapping, but I'd rather it didn't get worse.
Renzhi is a MS employee like me who is moonlighting this project. He is an expert on font, Unicode and layout. He has been working with us internally to improve the overall Unicode support by providing an algorithm for general text width measurement.
获取 Outlook for iOShttps://aka.ms/o0ukef
OK, thanks for info. As long as you guys are confident that this is going work out OK, feel free to ignore my questions - I don't need to know all the details. The main thing was I wanted to make sure you weren't overlooking something in the design that might later turn out to be a problem.
since we said his name 3 times: @reli-msft
@j4james I can try to sum up his core suggestion. Sorry -- this will be terse and mostly incorrect.
Using only primitive knowledge of codepoints (no font info), we can approximate the column count and combining behavior of any string of text. Renzhi proposes an algorithm (to be submitted for public standard review at a later date, and consideration by other terminal emulators much in the same way as Egmont's BiDi spec) and a reference implementation (which applications could eventually use for measuring) and an optional font feature that a font author can use to enforce specific constraints (e.g. "the Terminal Width 1 calculator would measure this to be 2 columns; applying font feature tw01 will ensure that this does measure two columns when rendered"). The font stuff aside, though, the primitives are such that we'd store this in the buffer and maintain the invariants that preexisting applications (windows and VT) are expecting. :smile:
It also comes with some suggestions about an "editing zone"/"active zone", which dictates where the cursor is/how appending combining glyphs works/how destruction of multi-column glyphs works.
My primary intent is twofold:
- Refactor the buffer to make column counts more explicit, rather than imputing them based on "is the character double-width and does it happen to be actually doubled in the backing store and have we set the flags appropriately"
- Have somewhere to store additional info (like cluster counts, bidi paragraph info, etc.)
We're in the always-weird situation where wcwidth is insufficient for things that are more than one wchar in code unit length, and wcswidth doesn't appear to have been holistically designed for this. Hoping that by offering a library and a public spec that others could buy off on we can build some community energy around text layout.
With the current test implementation, element N in the Row buffer will not map to column N -- simply because we want to to support one code unit taking up two columns (U+30AB KATAKANA LETTER KA) and two code units taking up one column (U+1F574 MAN IN BUSINESS SUIT LEVITATING). Like the DirectWrite clustering algorithm, we need an aside table that lets us figure out what code unit range covers a specific column (or column range). That's transparent, though. The interface for Row exposes NewMagicGlyphAt (name TBD), which will return the entire codepoint (all code units) in a specific column and the original starting column & column extent in case the requested column was partway through a multi-column glyph.
More typing, more e-mails for everyone.
We can easily keep the invariant that Row::bufffer[N] maps to column N in the DBCS case for U+30AB (simply double or triple the character), but I cannot see how we will keep the invariant for U+1F574 (since it will always be two code units) unless we go to UTF-32, which will certainly double our memory usage in the common case.
We already don't totally have that mapping today... the UnicodeStorage is weird (it's a map keyed on row # + column #) and stores any codepoint that requires more than one code unit. I expect that it is very expensive in the uncommon case, and I have found to date three places were we don't actually manipulate it properly.
Storing everything inline with a running advance counter seems much more . . . civil?
@j4james I am expecting that, the measuring algorithm that I am working on, will return the same results as wcwidth if the string doesn't involve proper complex scripts (like Devanagari, or Emoji that used ZWJ).
However when dealing proper complex scripts, that purposed algorithm will take the context into consideration and return more accurate results for them. For example, the purposed measuring algorithm will be able to handle Halants (Viramas), which is frequently used in Indic scripts that (in many cases) turn consonant letters around it into combining forms.
I understand why you want to do this, but if your proposed algorithm is going to break existing software (which I assume will be the case), then all I ask is that we make it opt-in via a private mode sequence. That way, apps that rely on existing terminal behavior will still work, and those that want to make use of the new improved measuring algorithm can still enable it explicitly.
Also, if you haven't already done so, I'd also encourage you to read up on some of the previous proposals for addressing wcwidth limitations, and the problems that arise from trying to make something like this work in a terminal emulator.
if your proposed algorithm is going to break existing software (which I assume will be the case) (JH)
maintain the invariants that preexisting applications (windows and VT) are expecting (DH)
...will return the same results as
wcwidthif the string doesn't involve proper complex scripts... (RL)
I realize that this is caveated (third quote), but existing software that is aware of 1- or 2- column widths shouldn't see a change in behavior.
Legitimate question though- are there conceivably applications that would be relying on terminal emulators to poorly compose combining characters in such a way that they would be broken if composition were fixed? Is bad composition the right thing to keep? I suspect applications just avoid it...
I'd previously forwarded @reli-msft the discussions on the VT working group & gnome-terminal trackers about wcwidth and addressing its limitations. We're not working in a vacuum :smile:
Legitimate question though- are there conceivably applications that would be relying on terminal emulators to poorly compose combining characters in such a way that they would be broken if composition were fixed?
Any application that deals with generic text that it needs to wrap or clip to fit an area of the screen. Text editors, email clients, web browsers, irc/chat clients, multiplexers. If the text turns out to be longer than the app expected, that can potentially end up breaking the entire layout of the page.
Then there are things like cursor positioning, highlighting of search terms, and link text. If the app expects a certain word to appear at a particular offset on the line, but it's actually somewhere else because our measurements aren't in sync, then they're going to be highlighting the wrong text, or placing the cursor in the wrong position.
So yeah, it's nice if you can get your combining characters to render perfectly, but not if that breaks everything else.
@j4james
I am expecting that for LGC and CJK writing systems, most symbols and pictograms, alongside "simple" Emoji set, the purposed measurer (TMe1) will be compatible with wcwidth.
For the rest, like Devanagari or Telugu, their usage in CLI applications is very limited, and both terminals and CLI applications' current behavior around them is already severely broken. Therefore, it is not a big problem to create a better but non-compatible algorithm for these writing systems. There will also be a termcap for complex-script-awareness.
If strict wcwidth compatibility is really a strong need, the purposed complex script support architecture is also capable to introduce multiple measurers, so we could make either the old or new measurer opt-in.
I’d like to point out that the current TextBuffer implementation is also a huge blocker that stands in the way of high IO throughout & overall performance. The inability to batch read/write a large range of characters in the TextBuffer is disastrous in terms of IO throughput.
If the team decides to improve the ConPTY & WT performance and take it to the next level, this issue is likely the way to go.
Update: An experimental PR #10702 to demonstrate the performance benefit of text-buffer rewrite.
@DHowett Do you think this PowerShell issue is actually a Windows Terminal issue? Note the console width is odd (125). The unknown char glyphs don't show up when the console width is even.

We can repro this with WSL/Bash in Windows Terminal (no PowerShell). But we can't repro in other terms (iTerm) using PowerShell.
Thinking about this issue some more, I realized that almost all of our remaining problems can be expressed with just a single goal in mind: The IsGlyphFullWidth function needs to be removed.
Of course it has to remain in some places for the terminal to function (TextBuffer primarily), but in most places it needs to go. This is for two reasons:
- The vast majority of code uses
bool IsGlyphFullWidth(wchar_t)and not itsstd::wstring_viewvariant. In other words, they don't work with surrogate pairs, so introducing something as advanced as grapheme clusters isn't even remotely possible there. - The mere existence of a call to
IsGlyphFullWidthalone is often indication enough that the code is trying to measure the size of strings in columns, even though theTextBufferis supposed to be the sole central authority for this. It'd be untenable to introduce proper, complex column measurement in the 28 places that currently re-measure strings without considering theTextBuffer.
Additionally OutputCellIterator should be merged into TextBuffer and ROW and it's 3 calls to IsGlyphFullWidth be replaced, so that we can add grapheme cluster segmentation there, which can't be trivially represented using an iterator like OutputCellIterator.
This requires the following changes:
- [x]
Search::s_CreateNeedleFromStringHas seemingly never been refactored to properly use theTextBufferiterators. 📋 Solution: Requires a basic refactor to ask the text buffer for width information. - [x]
TextBuffer::GetPatternsIt usesIsGlyphFullWidthbecause the regexprefix()is awstringand doesn't map back toTextBufferrow/column coordinates at all. That information was lost when this function concatenates all rows into a single string. 📋 Solution: Don't concatenate all rows into a single string. C++ regexes accept any iterators as long as they dereference to characters. Just create a simple row-iterator for this function.prefix()will then return this custom iterator wrapper which can be used to get back the exact buffer locations that were skipped. - [x]
WriteCharsLegacyIt filters control characters out of the incoming string and writes them into a local buffer on the stack. While doing so it measures how many columns the string is wide. It flushes the buffer toTextBufferwhenever it thinks the line has wrapped. ThenTextBufferimplicitly remeasures the entire string again because no one recorded how many columns each individual character was wide, just the grand total. 📋 Solution: Just pass the entire text to theROWthe cursor is on and let it consume characters in bulk until it's full. Advance the cursor, then write into the nextROWand so on until all text has been consumed. That way onlyROWneeds to measure anything. - [x]
CommandListPopup::_drawListMeasures strings so that it doesn't draw text outside of the popup. 📋 Solution: introduce a column limit to the bulk text write API introduced forWriteCharsLegacy. - [x]
ConsoleImeInfo::s_ConvertToCellsHundreds of lines to measure the size of the given input string to ensure it doesn't wrap a line/row, just likeWriteCharsLegacy. 📋 Solution: Grab theROW. Write the text. Done. - [ ]
_ConvertCellsToMungedWAs perApiRoutines::WriteConsoleOutputWImpl: "For compatibility reasons, we must maintain the behavior that munges the data if we are writing while a raster font is enabled. This can be removed when raster font support is removed." ~~Solution:~~ Don't touch. - [x]
InputBuffer::_ReadBufferRequires adult supervision to understand. Probably incorrect code.IsGlyphFullWidthis only called if non-Unicode text handling is requested, which only happens throughDirectReadData::Notify. Supposedly it's there so that if a non-Unicode client requests 4 events, that it emits 2 wide characters as DBCS instead of 4. But just like withTranslateUnicodeToOemthis seems quite wrong (UAX#11 != OEM DBCS). 📋 Solution: MergeSplitToOemintoInputBuffer::_ReadBuffer. Convert each individual character to OEM right there and see if it results in a DBCS or not. Don't useIsGlyphFullWidthfor that. Most likely requires extensive refactoring, because this part of conhost is a "big ball of mud". - [x]
TranslateUnicodeToOemBewildering code. CallsWideCharToMultiBytefor every individual character. UsesIsGlyphFullWidthto decide whether to check for DBCS or not. One could argue that this is to ensure that narrow glyphs are translated to non-DBCS and wide glyphs to DBCS, but nope, it translates non-DBCS wide glyphs to non-DBCS. Additionally, the width of a character as defined by UAX#11 in 2022 has nothing to do with whether a character was a DBCS sequence in the 90s. 📋 Solution: CallWideCharToMultiByteonce for the entire UTF16 string. It's way faster, simpler and more correct.- [x]
COOKED_READ_DATA::_handlePostCharInputLoopHorrifying. Contains 3 almost identical copies of the sameIsGlyphFullWidthloop, but with subtle changes to it, just to figure out the right amount of memory to preallocate when callingTranslateUnicodeToOem. This leaks an implementation detail ofTranslateUnicodeToOem, forcing callers to know howTranslateUnicodeToOemworks internally and forcingTranslateUnicodeToOemto not break that API contract ever.TranslateUnicodeToOemcan be easily refactored to know the exact amount it'll need. 📋 Solution: Just letTranslateUnicodeToOemfigure out the right amount of memory. - [x]
RAW_READ_DATA::Notify_ReadPendingInput_ReadCharacterInputExact same thing asCOOKED_READ_DATA::_handlePostCharInputLoop: CallsIsGlyphFullWidthjust to figure out the right amount of memory to preallocate. 📋 Solution: Just letTranslateUnicodeToOemfigure out the right amount of memory.
- [x]
- [ ]
InputBuffer::_CoalesceRepeatedKeyPressEventsThis function updates the repeat count of the last input event if you keep holding a key. It does so only if the input key is not a wide character, which seems highly... inconsistent? Why are wide characters different from narrow ones in this regard? There's no documentation saying why this is necessary. Solution? Not sure what to do here... - [x]
CheckBisectStringW"This routine check bisected on Unicode string end." The meaning of "bisect" is not documented but it means "a character doesn't fit into the last cell of a row". ThecBytesargument is a false hering and is actually used as a column count. The method then returns true if any wide wide character in the string (= 2 columns) doesn't fit into the remainingcBytesbecause only 1 column is left. This handling is basically entirely redundant withWriteCharsLegacy. 📋 Solution: Every single use ofCheckBisectStringWincommandline.cppis either redundant withWriteCharsLegacyalready, or can be replaced easily with just some minimal helper functions added toTextBufferandROWto allow for retrieving the preceding/following valid cursor position. - [x]
CheckBisectProcessWDoes the same thing asCheckBisectStringW, but it "processes" control characters too, you see, hence the name. 📋 Solution: Same asCheckBisectStringW: Remove all usage with existing alternatives. - [x]
RetrieveTotalNumberOfSpacesDoes the exact same thing asCheckBisectStringWandCheckBisectProcessWandWriteCharsLegacyandCommandListPopup::_drawList, but with a twist! This one returns the number of columns a string is long. 📋 Solution: Same asCheckBisectStringW: Remove all usage with existing alternatives. - [x]
AdaptDispatch::FillRectangularAreaSimilar toCommandListPopup::_drawList: It only usesIsGlyphFullWidthbecause ofOutputCellIteratorand itsfillWidthparameter, which is not in columns but rather in characters. This isn't just confusing/weird, but also forces callers to measure the width of the fill character, just so that thefillWidthparameter is in columns as expected. 📋 Solution:rm -rf OutputCellIterator.
Since 2022 is close to an end, any timeline on this?
https://github.com/microsoft/terminal/issues/1947 (issue that I am facing)+ https://github.com/microsoft/terminal/issues/1860 is making it almost impossible to use when SSHing into lot of devices/VMs.
And this ticket seems to be a blocker for #1947's potential MR.
Hmm. We got rid of UnicodeStorage in #13626. IRM might be easier now, now that we don't need to re-key all that. We may want to revisit.
IRM might be easier now, now that we don't need to re-key all that. We may want to revisit.
Did we end up revisiting this?
@ofek Support for IRM was added in Windows Terminal Preview v1.17.1023.
why this task is marked as completed when the linked issue is open?