terminal icon indicating copy to clipboard operation
terminal copied to clipboard

[Epic] Text Buffer rewrite

Open DHowett opened this issue 5 years ago • 27 comments

This is the issue tracking the great buffer rewrite of 202x.

Aims

  • [x] Refactor to remove the need for UnicodeStorage (which is a lookup table keyed on row+column)
    • Removing this allows us to remove ROW::_id, ROW::_pParent, CharRow::_pParent
  • [x] Reduce the fiddliness of the DBCS attribute APIs
    • DBCS attributes are stored for every character when they could be easily inferred from column position
  • [x] Add support for the storage of surrogate pairs
    • Surrogate pairs work today as an accident of fate: a pair of UTF-16 code units encoding a EA=wide codepoint is seen as wide, which conveniently matches how many wchar_t it takes up.
    • We have little to no proper support for a codepoint requiring two UTF-16 code units that is only seen as one column wide (#6555 (master issue), #6162 #8709)
  • [x] Provide a platform on which to build full ZWJ support (#1472)
  • [x] Kill CharRow, CharRowCell, CharRowCellReference
  • [ ] Reduce the static storage required to store a row (eventually) by not storing space characters
    • This should make MeasureRight faster, and therefore help fix #32.

Notes

Surrogate Pairs

Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.

I have done a small amount of work in WriteCharsLegacy. It is slow going.

Motivation

#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.

Implementation

The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)

In the common case (all narrow text), this is terribly wasteful.

To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).

Simple case - all glyphs narrow
 CHAR    A    B    C    D
UNITS 0041 0042 0043 0044
 COLS    1    1    1    1

Simple case - all glyphs wide
 CHAR   カ   タ   カ   ナ
UNITS 30ab 30bf 30ab 30ca
 COLS    2    2    2    2

Surrogate pair case - glyphs narrow
 CHAR         🕴        🕴        🕴
UNITS d83d dd74 d83d dd74 d83d dd74
 COLS    1    0    1    0    1    0

Surrogate pair case - glyphs wide
 CHAR        🥶        🥶        🥶
UNITS d83e dd76 d83e dd76 d83e dd76
 COLS    2    0    2    0    2    0

Representative complicated case
 CHAR        🥶    A    B         🕴
UNITS d83e dd76 0041 0042 d83d dd74
 COLS    2    0    1    1    1    0

Representative complicated case (huge character)
[FUTURE WORK]
 CHAR ﷽
UNITS         fdfd
 COLS           12

Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
 CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
 COLS    2    0    0    0    0

A column count of zero indicates a code unit that is a continuation of an existing glyph.

Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.

Work Log

  • [x] Add tests for reflow so that we can rewrite it (#8715)
  • [x] Hide more of CharRow/AttrRow's implementation details inside Row (#8446)
  • [ ] (from Michael) til::rle<T, S> - a run length encoded storage template, which we will use to store column counts

Other issues that might just be fixed by this

  • [ ] #8839
  • [x] #11756
  • [x] #32
  • [ ] #6987
  • [ ] #30

DHowett avatar Oct 22 '20 00:10 DHowett

Alright, I've claimed this issue for the text buffer updates.

DHowett avatar Jan 12 '21 03:01 DHowett

Representative complicated case (huge character)

TAB characters in https://github.com/microsoft/terminal/issues/7810 could also be stored this way.

KalleOlaviNiemitalo avatar Jan 12 '21 05:01 KalleOlaviNiemitalo

@KalleOlaviNiemitalo Indeed! I'd thought about this when designing the new advances-based buffer, but I was a little worried about the implications. We don't have good coverage for what happens when the cursor moves into a column covered by a wide glyph, and I think nailing that down will be important for getting tab-in-buffer working.

DHowett avatar Jan 12 '21 21:01 DHowett

Actually, on second thought ... this isn't as bad as I'd expected. We already move into multi-cell glyphs, and we don't damage them properly when they're overstruck. We don't render the cursor with the correct width when we're inside such a glyph either...

When the new damage and column-to-region mapping algorithms, this may be a breeze.

DHowett avatar Jan 12 '21 21:01 DHowett

Will this rewrite support a TextBuffer >= 32767? Currently ROW._id is a SHORT and TextBuffer size is a COORD, which is also composed of SHORTs.

Changing TextBuffer COORD references to til::point mean a major rewrite of half the code, but it's needed if you want to support Infinite Scrollback (with files).

naikel avatar Jan 13 '21 00:01 naikel

@DHowett I don't think this is as simple as you think. Lets say you write out the string ABCDEFGHIJ, move the cursor back to column 3 (on the character C), and then output a TAB. This moves the cursor to column 9 (on the character I), but there's no visible change to the content. In what way does the buffer now change to reflect that TAB?

Things like a zero-width-joiner are even more problematic. Let's say you write out a waving white flag (U+1F3F3), a space, and a rainbow (U+1F308), then go back and replace that space with a zero-width-joiner. The buffer sequence is now flag-zwj-rainbow, which Unicode defines as a rainbow flag. How would you alter the display to reflect those 3 characters collapsing into 1?

And whatever you do has to be compatible with the way other terminal emulators behave, and more importantly, the way apps expect the terminals to behave, otherwise they just won't work correctly in Windows Terminal. I suspect some of the things you're envisioning here are simply not possible.

j4james avatar Jan 13 '21 00:01 j4james

@j4james I'm not terribly concerned about tab - it's an auxiliary consideration for this spec. If we wanted to replicate iTerm's behavior here, it could only replace up to the next $TABSTOP characters with a N-column \t if the moved-over portion was otherwise stored as spaces.

ZWJ is another "future consideration". I'm hoping that this is more robust for now, rather than fully scalable for the future. N:M glyph mapping, inserting ZWJs, etc. is more impossible with what we have than this specification.

I wouldn't expect overstriking a non-joining space with a joining one to ever join adjacent columns once they're frozen in place in the text buffer. The behavior would, rather, be "ZWJ only attaches characters if they are written consecutively, and a cursor move breaks a character-in-composition" or "ZWJ is zero-width and always attaches to the character stored in the prior column." Either of those would be more possible than what we're working with today. It doesn't necessarily mean they're right, it just means that our horizon expands to better consider them.

All that, though, is not in scope right now. Incremental progress :smile: while retaining compatibility with what applications (using both the Win32 Console APIs and VT) do today are my goal.

DHowett avatar Jan 13 '21 00:01 DHowett

@naikel breaking any dependency we have on COORD internally is a significant bonus.

DHowett avatar Jan 13 '21 00:01 DHowett

@DHowett I tried once to extend TextBuffer >= 32767 changing ROW._id from SHORT to size_t and TextBuffer COORD to til::point and after three days failed miserably. I think it's easier to just code a whole new terminal from scratch.

EDIT: I really hope you can do this!

naikel avatar Jan 13 '21 00:01 naikel

Thoughts for replacing the global function pointer that determines glyph width (which the Renderer uses to answer for ambiguous glyphs).

struct IMeasurer {
    // Measures one codepoint, stored as UTF-16 code units
    virtual CodepointWidth MeasureCodepoint(std::wstring_view) = 0;

    // Measures multiple codepoints, stored as a string of UTF-16 code units.
    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
	virtual std::vector<CodepointWidth> MeasureString(std::wstring_view) = 0;
}
struct UnicodeUCDMeasurer : public IMeasurer {
    // implements it using raw compiled-in unicode UCD, never asks anyone else, totally static
}
struct RendererFallbackMeasurer : public IMeasurer {
    IMeasurer* rootMeasurer;
    RenderEngine* renderEngine;

    // IFF rootMeasurer returns Ambiguous widths for any individual codepoints, ask the render engine
    // This is only required for conhost, where we **must** maintain compatibility with "font dictates display width"
}

DHowett avatar Jan 29 '21 04:01 DHowett

    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.

Can I ask what Renzhi's algorithm is? My concern is if it doesn't match the behaviour of the wcwidth routines that Linux apps often use, then those apps are just not going to work correctly in WT, no matter how brilliantly we render zero-width joiners.

Also how does this relate to what is stored in the buffer? Is the intention to measure before deciding how something gets stored, or is the measuring to determine how the buffer is rendered? The reason I ask, is because if element N in the row buffer doesn't map to column N on the display, then some VT operations will also not work correctly.

This can even be a problem for client-side operations, like search and selection, although that at least is something we could maybe compensate for. And I know the DX renderer already has issues with the buffer to screen mapping, but I'd rather it didn't get worse.

j4james avatar Jan 29 '21 21:01 j4james

Renzhi is a MS employee like me who is moonlighting this project. He is an expert on font, Unicode and layout. He has been working with us internally to improve the overall Unicode support by providing an algorithm for general text width measurement.

获取 Outlook for iOShttps://aka.ms/o0ukef

skyline75489 avatar Jan 29 '21 23:01 skyline75489

OK, thanks for info. As long as you guys are confident that this is going work out OK, feel free to ignore my questions - I don't need to know all the details. The main thing was I wanted to make sure you weren't overlooking something in the design that might later turn out to be a problem.

j4james avatar Jan 29 '21 23:01 j4james

since we said his name 3 times: @reli-msft

zadjii-msft avatar Feb 01 '21 12:02 zadjii-msft

@j4james I can try to sum up his core suggestion. Sorry -- this will be terse and mostly incorrect.

Using only primitive knowledge of codepoints (no font info), we can approximate the column count and combining behavior of any string of text. Renzhi proposes an algorithm (to be submitted for public standard review at a later date, and consideration by other terminal emulators much in the same way as Egmont's BiDi spec) and a reference implementation (which applications could eventually use for measuring) and an optional font feature that a font author can use to enforce specific constraints (e.g. "the Terminal Width 1 calculator would measure this to be 2 columns; applying font feature tw01 will ensure that this does measure two columns when rendered"). The font stuff aside, though, the primitives are such that we'd store this in the buffer and maintain the invariants that preexisting applications (windows and VT) are expecting. :smile:

It also comes with some suggestions about an "editing zone"/"active zone", which dictates where the cursor is/how appending combining glyphs works/how destruction of multi-column glyphs works.

My primary intent is twofold:

  1. Refactor the buffer to make column counts more explicit, rather than imputing them based on "is the character double-width and does it happen to be actually doubled in the backing store and have we set the flags appropriately"
  2. Have somewhere to store additional info (like cluster counts, bidi paragraph info, etc.)

We're in the always-weird situation where wcwidth is insufficient for things that are more than one wchar in code unit length, and wcswidth doesn't appear to have been holistically designed for this. Hoping that by offering a library and a public spec that others could buy off on we can build some community energy around text layout.

DHowett avatar Feb 01 '21 20:02 DHowett

With the current test implementation, element N in the Row buffer will not map to column N -- simply because we want to to support one code unit taking up two columns (U+30AB KATAKANA LETTER KA) and two code units taking up one column (U+1F574 MAN IN BUSINESS SUIT LEVITATING). Like the DirectWrite clustering algorithm, we need an aside table that lets us figure out what code unit range covers a specific column (or column range). That's transparent, though. The interface for Row exposes NewMagicGlyphAt (name TBD), which will return the entire codepoint (all code units) in a specific column and the original starting column & column extent in case the requested column was partway through a multi-column glyph.

DHowett avatar Feb 01 '21 21:02 DHowett

More typing, more e-mails for everyone.

We can easily keep the invariant that Row::bufffer[N] maps to column N in the DBCS case for U+30AB (simply double or triple the character), but I cannot see how we will keep the invariant for U+1F574 (since it will always be two code units) unless we go to UTF-32, which will certainly double our memory usage in the common case.

We already don't totally have that mapping today... the UnicodeStorage is weird (it's a map keyed on row # + column #) and stores any codepoint that requires more than one code unit. I expect that it is very expensive in the uncommon case, and I have found to date three places were we don't actually manipulate it properly.

Storing everything inline with a running advance counter seems much more . . . civil?

DHowett avatar Feb 01 '21 21:02 DHowett

@j4james I am expecting that, the measuring algorithm that I am working on, will return the same results as wcwidth if the string doesn't involve proper complex scripts (like Devanagari, or Emoji that used ZWJ).

However when dealing proper complex scripts, that purposed algorithm will take the context into consideration and return more accurate results for them. For example, the purposed measuring algorithm will be able to handle Halants (Viramas), which is frequently used in Indic scripts that (in many cases) turn consonant letters around it into combining forms.

reli-msft avatar Feb 01 '21 21:02 reli-msft

I understand why you want to do this, but if your proposed algorithm is going to break existing software (which I assume will be the case), then all I ask is that we make it opt-in via a private mode sequence. That way, apps that rely on existing terminal behavior will still work, and those that want to make use of the new improved measuring algorithm can still enable it explicitly.

Also, if you haven't already done so, I'd also encourage you to read up on some of the previous proposals for addressing wcwidth limitations, and the problems that arise from trying to make something like this work in a terminal emulator.

j4james avatar Feb 01 '21 23:02 j4james

if your proposed algorithm is going to break existing software (which I assume will be the case) (JH)

maintain the invariants that preexisting applications (windows and VT) are expecting (DH)

...will return the same results as wcwidth if the string doesn't involve proper complex scripts... (RL)

I realize that this is caveated (third quote), but existing software that is aware of 1- or 2- column widths shouldn't see a change in behavior.

Legitimate question though- are there conceivably applications that would be relying on terminal emulators to poorly compose combining characters in such a way that they would be broken if composition were fixed? Is bad composition the right thing to keep? I suspect applications just avoid it...

I'd previously forwarded @reli-msft the discussions on the VT working group & gnome-terminal trackers about wcwidth and addressing its limitations. We're not working in a vacuum :smile:

DHowett avatar Feb 02 '21 00:02 DHowett

Legitimate question though- are there conceivably applications that would be relying on terminal emulators to poorly compose combining characters in such a way that they would be broken if composition were fixed?

Any application that deals with generic text that it needs to wrap or clip to fit an area of the screen. Text editors, email clients, web browsers, irc/chat clients, multiplexers. If the text turns out to be longer than the app expected, that can potentially end up breaking the entire layout of the page.

Then there are things like cursor positioning, highlighting of search terms, and link text. If the app expects a certain word to appear at a particular offset on the line, but it's actually somewhere else because our measurements aren't in sync, then they're going to be highlighting the wrong text, or placing the cursor in the wrong position.

So yeah, it's nice if you can get your combining characters to render perfectly, but not if that breaks everything else.

j4james avatar Feb 02 '21 00:02 j4james

@j4james

I am expecting that for LGC and CJK writing systems, most symbols and pictograms, alongside "simple" Emoji set, the purposed measurer (TMe1) will be compatible with wcwidth.

For the rest, like Devanagari or Telugu, their usage in CLI applications is very limited, and both terminals and CLI applications' current behavior around them is already severely broken. Therefore, it is not a big problem to create a better but non-compatible algorithm for these writing systems. There will also be a termcap for complex-script-awareness.

If strict wcwidth compatibility is really a strong need, the purposed complex script support architecture is also capable to introduce multiple measurers, so we could make either the old or new measurer opt-in.

reli-msft avatar Feb 02 '21 01:02 reli-msft

I’d like to point out that the current TextBuffer implementation is also a huge blocker that stands in the way of high IO throughout & overall performance. The inability to batch read/write a large range of characters in the TextBuffer is disastrous in terms of IO throughput.

If the team decides to improve the ConPTY & WT performance and take it to the next level, this issue is likely the way to go.

Update: An experimental PR #10702 to demonstrate the performance benefit of text-buffer rewrite.

skyline75489 avatar Jul 16 '21 11:07 skyline75489

@DHowett Do you think this PowerShell issue is actually a Windows Terminal issue? Note the console width is odd (125). The unknown char glyphs don't show up when the console width is even.

image

We can repro this with WSL/Bash in Windows Terminal (no PowerShell). But we can't repro in other terms (iTerm) using PowerShell.

rkeithhill avatar Sep 02 '21 05:09 rkeithhill

Thinking about this issue some more, I realized that almost all of our remaining problems can be expressed with just a single goal in mind: The IsGlyphFullWidth function needs to be removed.

Of course it has to remain in some places for the terminal to function (TextBuffer primarily), but in most places it needs to go. This is for two reasons:

  • The vast majority of code uses bool IsGlyphFullWidth(wchar_t) and not its std::wstring_view variant. In other words, they don't work with surrogate pairs, so introducing something as advanced as grapheme clusters isn't even remotely possible there.
  • The mere existence of a call to IsGlyphFullWidth alone is often indication enough that the code is trying to measure the size of strings in columns, even though the TextBuffer is supposed to be the sole central authority for this. It'd be untenable to introduce proper, complex column measurement in the 28 places that currently re-measure strings without considering the TextBuffer.

Additionally OutputCellIterator should be merged into TextBuffer and ROW and it's 3 calls to IsGlyphFullWidth be replaced, so that we can add grapheme cluster segmentation there, which can't be trivially represented using an iterator like OutputCellIterator.

This requires the following changes:

  • [x] Search::s_CreateNeedleFromString Has seemingly never been refactored to properly use the TextBuffer iterators. 📋 Solution: Requires a basic refactor to ask the text buffer for width information.
  • [x] TextBuffer::GetPatterns It uses IsGlyphFullWidth because the regex prefix() is a wstring and doesn't map back to TextBuffer row/column coordinates at all. That information was lost when this function concatenates all rows into a single string. 📋 Solution: Don't concatenate all rows into a single string. C++ regexes accept any iterators as long as they dereference to characters. Just create a simple row-iterator for this function. prefix() will then return this custom iterator wrapper which can be used to get back the exact buffer locations that were skipped.
  • [x] WriteCharsLegacy It filters control characters out of the incoming string and writes them into a local buffer on the stack. While doing so it measures how many columns the string is wide. It flushes the buffer to TextBuffer whenever it thinks the line has wrapped. Then TextBuffer implicitly remeasures the entire string again because no one recorded how many columns each individual character was wide, just the grand total. 📋 Solution: Just pass the entire text to the ROW the cursor is on and let it consume characters in bulk until it's full. Advance the cursor, then write into the next ROW and so on until all text has been consumed. That way only ROW needs to measure anything.
  • [x] CommandListPopup::_drawList Measures strings so that it doesn't draw text outside of the popup. 📋 Solution: introduce a column limit to the bulk text write API introduced for WriteCharsLegacy.
  • [x] ConsoleImeInfo::s_ConvertToCells Hundreds of lines to measure the size of the given input string to ensure it doesn't wrap a line/row, just like WriteCharsLegacy. 📋 Solution: Grab the ROW. Write the text. Done.
  • [ ] _ConvertCellsToMungedW As per ApiRoutines::WriteConsoleOutputWImpl: "For compatibility reasons, we must maintain the behavior that munges the data if we are writing while a raster font is enabled. This can be removed when raster font support is removed." ~~Solution:~~ Don't touch.
  • [x] InputBuffer::_ReadBuffer Requires adult supervision to understand. Probably incorrect code. IsGlyphFullWidth is only called if non-Unicode text handling is requested, which only happens through DirectReadData::Notify. Supposedly it's there so that if a non-Unicode client requests 4 events, that it emits 2 wide characters as DBCS instead of 4. But just like with TranslateUnicodeToOem this seems quite wrong (UAX#11 != OEM DBCS). 📋 Solution: Merge SplitToOem into InputBuffer::_ReadBuffer. Convert each individual character to OEM right there and see if it results in a DBCS or not. Don't use IsGlyphFullWidth for that. Most likely requires extensive refactoring, because this part of conhost is a "big ball of mud".
  • [x] TranslateUnicodeToOem Bewildering code. Calls WideCharToMultiByte for every individual character. Uses IsGlyphFullWidth to decide whether to check for DBCS or not. One could argue that this is to ensure that narrow glyphs are translated to non-DBCS and wide glyphs to DBCS, but nope, it translates non-DBCS wide glyphs to non-DBCS. Additionally, the width of a character as defined by UAX#11 in 2022 has nothing to do with whether a character was a DBCS sequence in the 90s. 📋 Solution: Call WideCharToMultiByte once for the entire UTF16 string. It's way faster, simpler and more correct.
    • [x] COOKED_READ_DATA::_handlePostCharInputLoop Horrifying. Contains 3 almost identical copies of the same IsGlyphFullWidth loop, but with subtle changes to it, just to figure out the right amount of memory to preallocate when calling TranslateUnicodeToOem. This leaks an implementation detail of TranslateUnicodeToOem, forcing callers to know how TranslateUnicodeToOem works internally and forcing TranslateUnicodeToOem to not break that API contract ever. TranslateUnicodeToOem can be easily refactored to know the exact amount it'll need. 📋 Solution: Just let TranslateUnicodeToOem figure out the right amount of memory.
    • [x] RAW_READ_DATA::Notify _ReadPendingInput _ReadCharacterInput Exact same thing as COOKED_READ_DATA::_handlePostCharInputLoop: Calls IsGlyphFullWidth just to figure out the right amount of memory to preallocate. 📋 Solution: Just let TranslateUnicodeToOem figure out the right amount of memory.
  • [ ] InputBuffer::_CoalesceRepeatedKeyPressEvents This function updates the repeat count of the last input event if you keep holding a key. It does so only if the input key is not a wide character, which seems highly... inconsistent? Why are wide characters different from narrow ones in this regard? There's no documentation saying why this is necessary. Solution? Not sure what to do here...
  • [x] CheckBisectStringW "This routine check bisected on Unicode string end." The meaning of "bisect" is not documented but it means "a character doesn't fit into the last cell of a row". The cBytes argument is a false hering and is actually used as a column count. The method then returns true if any wide wide character in the string (= 2 columns) doesn't fit into the remaining cBytes because only 1 column is left. This handling is basically entirely redundant with WriteCharsLegacy. 📋 Solution: Every single use of CheckBisectStringW in commandline.cpp is either redundant with WriteCharsLegacy already, or can be replaced easily with just some minimal helper functions added to TextBuffer and ROW to allow for retrieving the preceding/following valid cursor position.
  • [x] CheckBisectProcessW Does the same thing as CheckBisectStringW, but it "processes" control characters too, you see, hence the name. 📋 Solution: Same as CheckBisectStringW: Remove all usage with existing alternatives.
  • [x] RetrieveTotalNumberOfSpaces Does the exact same thing as CheckBisectStringW and CheckBisectProcessW and WriteCharsLegacy and CommandListPopup::_drawList, but with a twist! This one returns the number of columns a string is long. 📋 Solution: Same as CheckBisectStringW: Remove all usage with existing alternatives.
  • [x] AdaptDispatch::FillRectangularArea Similar to CommandListPopup::_drawList: It only uses IsGlyphFullWidth because of OutputCellIterator and its fillWidth parameter, which is not in columns but rather in characters. This isn't just confusing/weird, but also forces callers to measure the width of the fill character, just so that the fillWidth parameter is in columns as expected. 📋 Solution: rm -rf OutputCellIterator.

lhecker avatar Nov 23 '22 17:11 lhecker

Since 2022 is close to an end, any timeline on this?

https://github.com/microsoft/terminal/issues/1947 (issue that I am facing)+ https://github.com/microsoft/terminal/issues/1860 is making it almost impossible to use when SSHing into lot of devices/VMs.

And this ticket seems to be a blocker for #1947's potential MR.

Anutrix avatar Dec 11 '22 23:12 Anutrix

Hmm. We got rid of UnicodeStorage in #13626. IRM might be easier now, now that we don't need to re-key all that. We may want to revisit.

zadjii-msft avatar Dec 12 '22 01:12 zadjii-msft

IRM might be easier now, now that we don't need to re-key all that. We may want to revisit.

Did we end up revisiting this?

ofek avatar Mar 04 '23 22:03 ofek

@ofek Support for IRM was added in Windows Terminal Preview v1.17.1023.

j4james avatar Mar 04 '23 22:03 j4james

Screenshot 2023-05-02 161212 why this task is marked as completed when the linked issue is open?

mominshaikhdevs avatar May 02 '23 10:05 mominshaikhdevs