This commit is a from-scratch rewrite of ROW with the primary goal to get rid of the rather bodgy UnicodeStorage class and improve Unicode support.

Previously a 120x9001 terminal buffer would store a vector of 9001 ROWs where each ROW stored exactly 120 wchar_t. Glyphs exceeding their allocated space would be stored in the UnicodeStorage which was basically a hashmap<Coordinate, String>. Iterating over the text in a ROW would require us to check each glyph and fetch it from the map conditionally. On newlines we'd have to invalidate all map entries that are now gone, so for every invalidated ROW we'd iterate through all glyphs again and if a single one was stored in UnicodeStorage, we'd then iterate through the entire hashmap to remove all coordinates that were residing on that ROW. All in all, this wasn't the most robust nor performant code.

The new implementation is simple (from a design perspective): Store all text in a ROW in a regular string. Grow the string if needed. The association between columns and text works by storing character offsets in a column-wide array. This algorithm is <100 LOC and removes ~1000.

As an aside this PR does a few more things that go hand in hand:

Remove most of ROW helper classes, which aren't needed anymore.
Allocate backing memory in a single VirtualAlloc call.
Rewrite IsCursorDoubleWidth to use DbcsAttrAt directly. Improves overall performance by 10-20% and makes this implementation faster than the previous NxM storage, despite the added complexity.

Part of #8000

Validation Steps Performed

TBD

Jul 29 '22 15:07 lhecker

I love it.

Jul 29 '22 17:07 DHowett

I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane

Jul 29 '22 17:07 DHowett

I haven't found it yet, but how do you handle the case where a double-cell character is inserted on a line where only a single-cell character could fit on the right side? We used to keep track of whether we had "forced" a wide glyph on to the next line, and we used that to reconsitute something for the API i believe...

Jul 29 '22 17:07 DHowett

That's still all there in ROW::WriteCells which still has the SetDoubleBytePadded(true) logic. This PR doesn't improve Unicode support just yet - it only makes it possible to support in the future once we get rid of ROW::WriteCells and until then, in this PR, all the existing code around it was kept intact.

Jul 29 '22 17:07 lhecker

@check-spelling-bot Report

:red_circle: Please review

See the :open_file_folder: files view or the :scroll:action log for details.

Unrecognized words (1)

Qxxxxxx

Previously acknowledged words that are now absent

AOn AStomps Childitem

To accept :heavy_check_mark: these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands

... in a clone of the [email protected]:microsoft/terminal.git repository on the dev/lhecker/8000-text-buffer-rework branch (:information_source: how do I use this?):

update_files() {
perl -e '
my @expect_files=qw('".github/actions/spelling/expect/alphabet.txt
.github/actions/spelling/expect/expect.txt
.github/actions/spelling/expect/web.txt"');
@ARGV=@expect_files;
my @stale=qw('"$patch_remove"');
my $re=join "|", @stale;
my $suffix=".".time();
my $previous="";
sub maybe_unlink { unlink($_[0]) if $_[0]; }
while (<>) {
if ($ARGV ne $old_argv) { maybe_unlink($previous); $previous="$ARGV$suffix"; rename($ARGV, $previous); open(ARGV_OUT, ">$ARGV"); select(ARGV_OUT); $old_argv = $ARGV; }
next if /^(?:$re)(?:(?:\r|\n)*$| .*)/; print;
}; maybe_unlink($previous);'
perl -e '
my $new_expect_file=".github/actions/spelling/expect/64ca898ea5d0eddce7a18382613742f69a05af40.txt";
use File::Path qw(make_path);
use File::Basename qw(dirname);
make_path (dirname($new_expect_file));
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"$patch_add"');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a)."-".$a cmp lc($b)."-".$b} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;
system("git", "add", $new_expect_file);
'
}

comment_json=$(mktemp)
curl -L -s -S \
-H "Content-Type: application/json" \
"https://api.github.com/repos/microsoft/terminal/issues/comments/1200446240" > "$comment_json"
comment_body=$(mktemp)
jq -r ".body // empty" "$comment_json" | tr -d "\\r" > $comment_body
rm $comment_json

patch_remove=$(perl -ne 'next unless s{^</summary>(.*)</details>$}{$1}; print' < "$comment_body")

patch_add=$(perl -e '$/=undef; $_=<>; if (m{Unrecognized words[^<]*</summary>\n*```\n*([^<]*)```\n*</details>$}m) { print "$1" } elsif (m{Unrecognized words[^<]*\n\n((?:\w.*\n)+)\n}m) { print "$1" };' < "$comment_body")

update_files
rm $comment_body
git add -u

Available dictionaries could cover words not in the dictionary

This includes both expected items (2856) from .github/actions/spelling/expect/alphabet.txt .github/actions/spelling/expect/expect.txt .github/actions/spelling/expect/web.txt and unrecognized words (1)

cspell:filetypes/filetypes.txt (337) covers 36 of them cspell:django/django.txt (2342) covers 23 of them cspell:html/html.txt (542) covers 22 of them cspell:aws/aws.txt (1485) covers 20 of them cspell:fullstack/fullstack.txt (181) covers 19 of them

Consider adding them using (in .github/workflows/spelling2.yml):

      with:
        extra_dictionaries:
          cspell:filetypes/filetypes.txt
          cspell:django/django.txt
          cspell:html/html.txt
          cspell:aws/aws.txt
          cspell:fullstack/fullstack.txt

To stop checking additional dictionaries, add:

      with:
        check_extra_dictionaries: ''

:pencil2: Contributor please read this

By default the command suggestion will generate a file named based on your commit. That's generally ok as long as you add the file to your commit. Someone can reorganize it later.

:warning: The command is written for posix shells. If it doesn't work for you, you can manually add (one word per line) / remove items to expect.txt and the excludes.txt files.

If the listed items are:

... misspelled, then please correct them instead of using the command.
... names, please add them to .github/actions/spelling/allow/names.txt.
... APIs, you can add them to a file in .github/actions/spelling/allow/.
... just things you're using, please add them to an appropriate file in .github/actions/spelling/expect/.
... tokens you only need in one place and shouldn't generally be used, you can add an item in an appropriate file in .github/actions/spelling/patterns/.

See the README.md in each directory for more information.

:microscope: You can test your commits without appending to a PR by creating a new branch with that extra change and pushing it to your fork. The check-spelling action will run in response to your push -- it doesn't require an open pull request. By using such a branch, you can limit the number of typos your peers see you make. :wink:

:clamp: If the flagged items are false positives

If items relate to a ...

binary file (or some other file you wouldn't want to check at all).

Please add a file path to the excludes.txt file matching the containing file.

File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).
well-formed pattern.

If you can write a pattern that would match it, try adding it to the patterns.txt file.

Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

Note that patterns can't match multiline strings.

Jul 31 '22 15:07 github-actions[bot]

I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane

I think it's worth mentioning that I was using the row IDs to track image segments in my sixel implementation. Not suggesting that's a good reason to keep them, since I'm not likely to do anything with that branch any time soon, but I do think there is value in having some mechanism via which additional metadata can be attached a row.

Another potential use case would be the scrollbar marks. I think we're currently tracking them with just a row number, but long term I'm hoping we'll change that, because I suspect the current approach will break if we ever get margins working in the terminal. That's perhaps a bit of an edge case, so not critical, but still.

Aug 09 '22 11:08 j4james

[...] but I do think there is value in having some mechanism via which additional metadata can be attached a row.

Why not store or reference that data in ROW itself? Can sixels be treated as (immutable) bitmaps? If so I can imagine it might be fairly performant and simple to store a vector of shared_ptr to the sixels/bitmaps on each row that they touch. We should then still be able to infer the exact pixels that need to be rendered on each row given the height stored in the referenced sixel/bitmap.

Aug 09 '22 11:08 lhecker

Why not store or reference that data in ROW itself?

In the case of sixel, 99% of the time you're not likely to have any data attached to the row - most users will probably never use sixel - so I didn't want to introduce an additional field that would just be wasting space most of the time. And since the ID already existed, that provided an easy way to add sixel functionality without impacting anyone that wasn't using it.

Having some kind of pointer on the row itself would certainly be easier, but it does mean more overhead. And what happens when we have some other bit of metadata that's also rarely used, and also needs to be associated with a row? Do we add yet another field? Maybe some kind of linked list of extension points? I don't know. I just wanted to point out that this use case exists, and row IDs were quite a compact way of addressing that need.

Can sixels be treated as (immutable) bitmaps?

Not really. Sixel isn't so much an image format - it's more of a drawing protocol. So you need to think of it in terms of plotting arbitrary pixels at a point on the screen. The way I managed that was by creating a bitmap associated with each row that was touched by sixel output. Those bitmaps obviously need to be mutable so they can be updated when additional pixels are output.

But any kind of reference in the row would be fine for this - I was just looking for a way to minimize the overhead involved, and potentially share it with other row extensions.

Anyway, this isn't anything that has to be resolved now. I just wanted to the raise the issue as something to consider for the future.

Aug 09 '22 13:08 j4james

Thanks for the feedback! I don't disagree with you and I'd be absolutely open to add back IDs even if for no reason at all. 🙂

But unrelated to this PR and any designs you have in mind (which I'd approve regardless), I'd like to mention an opposing view to the overhead of unused fields, just because... Right now on x64/ARM64 a ROW is 584 bytes and this PR will shrink it down to just 84 bytes (with the option to shrink it further down to just 48 bytes). Even if we add a dozen pointers or so to ROW we'd still have way less overhead than before. Because we have many more potential areas were we can get drastic improvements like that, I strongly feel like we shouldn't worry at all about memory usage of a few unused fields and rather pick whatever is the easiest for us to write and the CPU to process. 🙂

Aug 09 '22 13:08 lhecker

Any updates on this

Oct 12 '22 15:10 AZero13

It will be valuable to test Windows vim.exe and FAR Manager, both of which read parts of the buffer and then write them back. FAR uses it for dialogs and menus -- make sure that slicing DBCS chars works rright.

Nov 09 '22 19:11 DHowett

It will be valuable to test Windows vim.exe and FAR Manager, both of which read parts of the buffer and then write them back. FAR uses it for dialogs and menus -- make sure that slicing DBCS chars works rright.

I found that very hard to test which is why I ended up writing the DbcsTests::TestDbcsBackupRestore() feature test. I feel like it covers this scenario way better than any manual testing could. 🙂

Nov 09 '22 20:11 lhecker

I feel like it covers this scenario way better than any manual testing could. 🙂

Yes and no! I agree. But also, synthetic testing only tests things that we are aware of. 😁

I wonder if @alabuzhev knows about any specific cases that are at risk for which we can add explicit synthetic testing.

Nov 09 '22 20:11 DHowett

https://user-images.githubusercontent.com/2256941/200969501-58f3153e-0624-483f-9fd7-1d6a23f98c64.mp4

Nov 09 '22 20:11 lhecker

I wonder if @alabuzhev knows about any specific cases that are at risk for which we can add explicit synthetic testing.

Could you please briefly summarize what is this about? It's a quite long thread. Do I understand correctly that ReadConsoleOutputW should work better now (#10810)?

Nov 09 '22 22:11 alabuzhev

Could you please briefly summarize what is this about? It's a quite long thread. Do I understand correctly that ReadConsoleOutputW should work better now (#10810)?

No that doesn't change it unfortunately. What this PR changes is how text is stored, ensuring that a COMMON_LVB_LEADING_BYTE is always followed by a COMMON_LVB_TRAILING_BYTE no matter what. If that can't be done then the cells will be filled with whitespace. But this poses a problem of course: If you make a snapshot with ReadConsoleOutputW and then restore it with WriteConsoleOutputW, we'll have to be careful that any partially captured leading/trailing characters are restored as wide glyphs. Apart from that, this PR is a fundamental rewrite of our text storage. Here's how I tested it with Far:

https://user-images.githubusercontent.com/2256941/200969501-58f3153e-0624-483f-9fd7-1d6a23f98c64.mp4

Is there any other area you can think of we should test? Or would you say that this already sufficiently shows that it works correctly?

Nov 10 '22 00:11 lhecker

Thanks for the explanation.

If you make a snapshot with ReadConsoleOutputW and then restore it with WriteConsoleOutputW, we'll have to be careful that any partially captured leading/trailing characters are restored as wide glyphs.

I think there are no specific cases that are at risk here: currently we always capture the whole viewport and it happens quite rarely, e.g. at startup and after launching something (flowchart). In other words, there is exactly one read in your video - right after type test.txt. All the subsequent fiddling with panels work with that in-memory snapshot without re-reading anything and all the "restored" characters are from that initial full read, so the fact that it looks good probably proves nothing in the context of this PR. Sorry to disappoint you.

Nov 10 '22 01:11 alabuzhev

Hello @lhecker!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

Nov 10 '22 15:11 ghost

Removing "AutoMerge" tag. At sync yesterday, we chatted about having this PR need 3 signoffs since it touches pretty fundamental code.

Nov 10 '22 16:11 carlos-zamora

3 approvals

tom and jerry train

Nov 11 '22 19:11 lhecker

:tada:Windows Terminal Preview v1.17.1023 has been released which incorporates this pull request.:tada:

Handy links:

Jan 24 '23 18:01 ghost

terminal
terminal copied to clipboard

Rewrite ROW to be Unicode capable

Validation Steps Performed

@check-spelling-bot Report

:red_circle: Please review

Unrecognized words (1)

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

terminal terminal copied to clipboard

Rewrite ROW to be Unicode capable

Validation Steps Performed

@check-spelling-bot Report

:red_circle: Please review

Unrecognized words (1)

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

terminal
terminal copied to clipboard

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.