Terminal.Gui
Terminal.Gui copied to clipboard
Improper parsing in TextFormatter.FindHotKey if hotkey specifier followed by unicode replacement character(s)
While doing a cleanup and performance pass on TextFormatter, the loops in the FindHotKey method made go hmmm.
They're basically Rune-ified IndexOf operations, looking for the hotkey specifier Rune and then the following Rune as the hotkey, so long as it isn't FFFD.
But the state machine doesn't account for the character following the hotkey specifier being FFFD.
When that happens, the state machine keeps running until it finds the next non-specifier and non-FFFD Rune and calls that the hotkey. That's of course not expected, but what makes it even more dangerous is that the index provided in the hotPos out parameter will be the index of the specifier, as with normal cases, but the following index is not the hotkey that it chose.
The naive fix is simple: Just have it reset curHotPos to -1 each time it finds Rune.ReplacementChar.
I've added several test cases that cover various sequences containing the hotkey specifier in at least one position and FFFD in at least one position.
I also see some simple ways to squeeze almost free performance out of that method, so I'm doing that as well (this was in the middle of performance work, after all). Mostly some span stuff and avoiding as many loop iterations as I can figure out without spending too much time on just that method.
So...
As I'm analyzing this, I'm noticing there are some invalid test cases (which I think was probably already known, based on comments around those tests).
In particular, the cases that include codepoint 0xE0B0 are incorrect.
Those shouldn't be returning as symbols, because that codepoint is in a private use area. If you feed that value to Rune.IsSymbol(thatRune), you'll get false, and if you feed it to Rune.GetUnicodeCategory(thatRune), you'll get UnicodeCategory.PrivateUse.
Those test cases should expect a false result, and we should probably have tests that include codepoints from each category, as well as combinations, both for scalars and for surrogate pairs, to ensure these methods are actually behaving properly. I'll at least be sure to fix the cases or the code, as appropriate. For that situation, it's both code and expected test result that need fixing.
I have what appears, so far, to be at least a more "correct" implementation (MUCH faster, too, for the majority of likely typical cases). I'll describe it in a follow-up comment here.
The new implementation of the FindHotKey method that I have is partially complete, in that it works seemingly correctly for BMP values, but still needs me to make the fixes to non-BMP cases.
The only loop that has been addressed, so far, is the first one in the method.
The first thing it does is checks if it can simply do an ordinal char scan of the string or if it has to resort to using the Rune enumerator, by checking if the hotkey specifier Rune is in the BMP. If it is, we can do an ordinal scan for that character, even if the string also contains surrogate pairs (that particular point is made in the Microsoft Learn Supplementary API Remarks docs for working with strings and makes sense based on how it's all defined anyway.
Anyway...
- If specifier is in the BMP, proceed to 1a, performing the loop beginning at index 0 of the string and ending one character before the end of the string (since it has to be a pair of chars to be valid):
1a. If at end of the loop, proceed with 2. Otherwise, find IndexOf the specifier, beginning at index, and set index to the resulting value. Proceed to 1b.
1b. If index was < 0 (not found) or we would go out of bounds, proceed to 1b1. Else proceed to 1c.
1b1. Set curHotPos to index. Terminate loop and proceed with 2.
1c. Get the rune that starts at index + 1 as
potentialHotKeyRune. Proceed to 1d. 1d. If all of the following are true ofpotentialHotKeyRune, proceed to 1e. Else increment index and proceed from 1a.- Is not FFFD
- Is not same as specifier
- Is in the BMP
- Is a letter, digit, symbol, or punctuation
1e. Set
curHotKeyto that rune andcurHotPostoindex. Terminate loop and proceed with 2.
- If either of the following are true, perform the currently existing full unicode loop. Otherwise, skip that and just proceed with the rest of the method, as we've done what needs to be done up to that point:
- specifier is not in the BMP
curHotPos > 0and the rune that starts atcurHotPosis not in the BMP (meaning either the previous loop didn't run (specifier not in BMP) or it did run but found a non-BMP rune after the specifier and bailed to enable fallback to this method).
That last step, for handling cases of either non-BMP specifier or when the specifier is BMP but the character following it is not, is currently still the original implementation, which I need to fix as described earlier. Just haven't gotten to that point yet.
When the first loop can run, this is dozens of times faster, in debug builds, and almost 400x faster in release builds, even before addressing the other loops that may need to run.
Ugh. Sorry the formatting didn't come out like I wanted with that. Some things didn't get indented which should have and some got indented which shouldn't have.
Here is the code for the first loop, which is the main new part. Note most of the ifs are inverted to avoid nesting:
ReadOnlySpan<char> stringSpan = text.AsSpan ();
if (hotKeySpecifier.IsBmp)
{
for(int index = 0;index < stringSpan.Length; index++)
{
index = text.IndexOf ((char)hotKeySpecifier.Value, index);
if (index < 0 || index + 1 >= text.Length)
{
curHotPos = index;
break;
}
if (!Rune.TryGetRuneAt (text, index + 1, out Rune potentialHotKeyRune))
{
continue;
}
if (potentialHotKeyRune == Rune.ReplacementChar
|| potentialHotKeyRune == hotKeySpecifier
|| !potentialHotKeyRune.IsBmp
|| (!Rune.IsLetterOrDigit (potentialHotKeyRune)
&& !Rune.IsPunctuation (potentialHotKeyRune)
&& !Rune.IsSymbol (potentialHotKeyRune)))
{
continue;
}
curHotKey = Rune.GetRuneAt (text, index + 1);
curHotPos = index;
break;
}
}
And here is the slightly modified second part, which is the fallback to full unicode, with the index reset fix applied:
if (!hotKeySpecifier.IsBmp || (curHotPos >= 0 && !Rune.GetRuneAt (text,curHotPos).IsBmp))
{
int index = 0;
foreach (Rune c in stringSpan.EnumerateRunes ())
{
if (c == Rune.ReplacementChar)
{
curHotPos = -1;
}
else if (c == hotKeySpecifier)
{
curHotPos = index;
}
else if (curHotPos > -1)
{
curHotKey = c;
break;
}
index++;
}
}
I'm not done with that, yet, so it may still have issues.
Note that using either or both of these with the current tests will fail on a few of them, so it's not worth bothering with that unless you're just curious. This is just so y'all can see what I'm thinking with this in case any of it raises any red flags.
I'm totally going to benchmark this against the TextElementEnumerator, too, now that we have a cross-platform guarantee of consistent and correct behavior with it (which we didn't have when supporting .net Framework or .net < 5.0, making it pretty much useless before).
I would almost put money on it outperforming the Rune implementation and being more correct, implicitly, since it handles all code points, including surrogates, in multiple encodings, out of the box. And with the way there was SIMD love given to a lot of text-related stuff in .net 8, I even bet it might give my little BMP semi-optimization a run for its money, too.
Also, about TextElementEnumerator and the StringInfo class...
It uses ICU, now, which means it uses the current "gold standard" unicode implementation, and will evolve with Unicode, unlike certain methods and properties of char and Rune which may be incorrect, incomplete, non-standard, or not agree with each other, even for the same codepoints (yes, even still in .net 8).
That sounds interesting and I'm curious.
TL;DR.
However, i scanned it and <3.
That said, it's dorky that HotKeySpecifier is a Rune. The only valid values are those typable on a keyboard with a single key press.
IMO, HotKeySpecifier should be Key.
At the time I remember that I warned about this and made some changes in v1 to not use FFFD. I'm glad you're moving to Key. Thanks.
TL;DR.
However, i scanned it and <3.
That said, it's dorky that
HotKeySpecifieris a Rune. The only valid values are those typable on a keyboard with a single key press.IMO,
HotKeySpecifiershould beKey.
Agreed.
In that case, the whole thing can just operate on a Span<char>.
Could be a tangible speedup on first load of something, especially if there are a ton of views.
Ok, so...
With that restriction, what constitutes a match only has to satisfy the following conditions:
- Must be specifier character immediately followed by a valid hotkey
- Valid specifiers and hotkeys are BMP, not in the PUA, and must be in any of the following unicode categories:
- Letter
- Digit
- Punctuation
- Symbol
One behavior I'm torn on is the current code's treatment of the hotkey specifier in the string, at all, especially if there are multiple specifier characters in the string. Right now, the behavior is to find the first specifier followed by a different character than the specifier, and then ignore the rest for hotkey purposes, but still remove them for display purposes.
A couple of issues with that:
- The code to find the width of the text subtracts the hotkey specifier width if it is found at all in the string, unconditionally, even if it didn't actually become a hotkey because it didn't pass the above conditions.
- That's not intuitive.
- Why not let the consumer use the same character for the specifier and hotkey?
- This is potentially especially troublesome if the text is generated programmatically or from some data source, and that text has the hotkey specifier character in it.
So, I'd like to change the behavior to allow that character to be displayed, where it isn't in the actually selected hotkey position. But I'd leave the selection behavior the same (first one found is it and the search stops there).
All that really changes is that strings such as "__Text", which would show up as "_Text" right now and have a hotkey of T would still be displayed the same way (though the 0th character was deleted, not the 1st) and the hotkey would be underscore.
Also
I think the behavior should be modified so that, if the HotKey property is already set (ie not KeyCode.Null), none of this even runs, so that explicitly setting the hotkey always wins.
All that really changes is that strings such as
"__Text", which would show up as"_Text"right now and have a hotkey of T would still be displayed the same way (though the 0th character was deleted, not the 1st) and the hotkey would be underscore.
I concours with the displaying of the specifier if more than once is provided because that it's the intention of the user.
Also
I think the behavior should be modified so that, if the HotKey property is already set (ie not KeyCode.Null), none of this even runs, so that explicitly setting the hotkey always wins.
I concur.
Coolio.
All that stuff should be pretty easy to do, then.
As I mentioned, I had started that branch off intending to just do opportunistic performance stuff for TextFormatter.
But, I think I'm going to separate at least some of it to a separate branch/PR just to contain scope a little bit. The thing is that I can't totally separate it all without intentionally doing superfluous work to make clean intermediate stages. So I guess we'll see what I have when I'm finished. 😅
One thing is we repeat a bunch of operations quite a bit.
So I'm trying to unwind some of that in TextFormatter and avoid at least some of that duplication of effort.
One that has the potential to be particularly heinous in a few ways is that most property setters in that class unconditionally set NeedsFormat, even if no value change occurred.
That of course can lead to unnecessary re-draws, which is a potentially large performance hit, especially with some of the work that whole process involves.
There's also a ton of implicit memcopy work in there, and the EnableNeedsFormat method is a non-trivial source of it, since it ends up mostly handling structs, both passed and returned by value, and then assigned to a field. I count at least 3, but possibly as many as 6 copies between the public API surface and NeedsFormat's backing field, just from following a few possible code paths. That's easily fixed though, by reference passing, so no big deal any more there.
But what worries me more is that the class is very not thread-safe. And it even has comments that claim it is, probably from old code. But that code isn't there and, in fact, there's literally no synchronization that has any effect anywhere in that class.
As far as NeedsFormat goes... While getting or setting a boolean is an atomic operation, in .net-land, the big problem (which isn't performance-related) is a race condition that can prevent necessary re-draws, if a property is set (causing NeedsFormat to get flipped to true) at any time before the re-draw completes and sets it false again.
And especially since consumer code can extend stuff or inject behaviors in various ways, the impact and result of that race condition is unbounded and unpredictable.
I'm going to end up at least implementing some basic synchronization in that class in a few places, because there are some serious problems. And some of those are things I've mentioned before but haven't gotten around to working on, yet, soooo... I guess now here it goes!
While I usually am fine with doing a little bit of extra work to make an easy-to-grok change progression across commits, this stuff is going to be beyond that threshold in several spots. So, There's a good chance I'll just end up re-titling this issue to cover what it ends up as. 😅
As a heads-up, probably a good idea to minimize or avoid entirely any significant modifications to TextFormatter.cs, if that doesn't hinder work items you've got going too much. If that's too much of a pain, then don't worry about it. I'll figure out how to make it all play nicely when it's time for a PR. 🤷♂️
Aaaannnnnnyway... Back to it!
If anyone has any wishlist items for TextFormatter, now is a great time to speak up!
Actually... At least to avoid one source of potential conflicts...
I'm going to cherry pick a few commits from what I've done so far that aren't breaking changes but are structurally quite different right now, and then rebase on that and go from there (dropping the cherry-picked commits from the rebase, of course).
That way, if anyone does need to make changes to those files, they'll at least be much easier to follow.
There we go.
#3282
That PR is a pretty minimal set of changes and some are also things that I myself have done multiple times but that somehow kept not making it into my PRs. 😅
Renamed this to reflect the expanded scope better.
#3287 is relevant to this, but posted as a separate issue for visibility, since it's a pretty core type.
That's being dealt with in this work.
Hey @tig @BDisp
TextFormatter.FindHotKey has a bool parameter to enable old behavior of using first capital letter instead of the character following a hotkey specifier.
Do we want that or can it be removed?
I vote remove for many reasons, not the least of which is that this is a pretty small thing compared to many other breaking changes. 🤷♂️
If it's left in, we do have to have this method support unicode, as well as anything that supports hotkeys, or else codepoints outside BMP that are classified as capitals won't be handled properly.
About that...
As it turns out, nothing internal to the library actually even uses it explicitly except for RadioGroup, which passes true unconditionally, which looks like old behavior that's almost certainly not intended, now, as far as I can tell.
Also for what it's worth, that change only required removing that code path from FindHotkey itself, removing the two test methods explicitly for that behavior, and removing the parameter from the calls in RadioGroup to compile and pass all tests, so it really wasn't being used or respected anywhere else, anyway.
But I did that in a specific commit. So, if y'all still want it, I can just revert that commit to bring it back.
I think @tig already decided to force specifying the hot key specifier for all situations. I don't know why RadioGroup wasn't considered too.
Cool. I already went forward with that assumption anyway.
Right now I'm trying to avoid as many enumerations of the string as possible within the scope of that method, without requiring other changes, and am benchmarking a few of the options available that can handle the fact that the string might be unicode, but that we only want values under U+D7FF for both the specifier and the hotkey.
In earlier discussion, we were thinking the use of Rune wasn't necessary, here. Unfortunately, unless we also impose the <= U+D7FF restriction on the actual text, we still have to care because the position will be different if any other characters before it are beyond the BMP. So, even if the parameter is changed to a char, it still ends up needing to be a Rune at some point or else there's unnecessary casting that the compiler unfortunately actually does emit, even in optimized builds.
There are multiple unicode-aware ways to do it now, including the old method of always going through it with the StringRuneEnumerator, and each one has different considerations/consequences for performance and code complexity. Since most should be fairly short strings, it's probably best to approach it with that assumption, optimizing for the most common case.
While I'd like to make other changes that would have even bigger impact, such as retaining the values and only performing the search again if the text is altered, I'm holding off on that kind of thing, for now, as those things will involve other code in more places and will need more extensive testing, just in case there are unforeseen side effects of that kind of thing.
Whoops. Finger tapped and clicked the button as I was scrolling. 🤦♂️
I also have a major love/hate relationship with Rune (more hate than love).
Why Microsoft thought it would be cool to add in a UTF-32 (which nobody really uses) type that - oh yeah - is only aware of scalars (which misses the point of huge parts of what unicode offers), among plenty of other issues, to deal with text, when the rest of the world was already on the UTF-8 train and dotnet already had and understood the UTF-8 encoding is just.... 🤯
Do we want that or can it be removed?
Nuke from orbit please.
Already done. :)
And I have a new implementation for this method, now, which:
- Is zero-allocation for all inputs.
- Not even an enumerator allocation.
- Doesn't create any additional Runes while it runs.
- Fixes existing unit test cases that were marked as bugged, so those test cases now look for the proper value (and pass with it).
- Has the new desired behavior of letting the specifier key be a legal value for the hotkey, if it finds two of them consecutively.
My code isn't even special - it just adapts for what does the real unicode-related magic, which is a specific overload of a static StringInfo method that is intended for exactly this sort of use case/pattern, but which is non-obvious and not particularly well-documented, IMO. I actually came across its existence and applicability by chance from a performance-related question issue filed on the dotnet runtime repo that I almost didn't even look at.
Now to expand the test cases for more positive proof for that unit and this will be ready to go.
Remember this branch contains other mostly cleanup and performance-related stuff, as well, that was part of some rabbit holes I dove down in the process.
I rebased it on v2_develop before I got back to work on it, today, so it's all up to date and conflict-free, with one (already resolved) exception. There was a dead/unused method I had removed that got modified in another PR, and I just resolved that by letting it use the code from v2_develop and didn't bother removing it again, since it was just opportunistic and isn't in scope.
Performance-wise, avoiding heap allocations is major, in any code, when possible.
Crude benchmarks showed this new implementation to be slightly better than on-par with an already-improved version I had written, in the worst case, but usually a LOT faster than that implementation, and faster in all cases vs the original with the capital search removed.