godot Add fuzzy string matching to quick open search

This is a continuation of https://github.com/godotengine/godot/pull/82200 to implement https://github.com/godotengine/godot-proposals/issues/7771. Rebasing the first PR against https://github.com/godotengine/godot/pull/56772 was pretty awkward so I ended up flattening the earlier changes with @samsface as a co-author. A bit more info and pre-flattened commits are available in this pr against sam's fork https://github.com/samsface/godot/pull/1.

Functionally this adds optional fuzzy searching and search highlighting in the results:

The highlighting on the grid items feels a bit hacky due to the centered text; I'd appreciate any tips to do it more cleanly. This is also my first time working on a nontrivial c++ codebase so I'd appreciate any general tips to be more idiomatic or if I'm using anything incorrectly. Also a few semi-related changes:

Fixes a bug where, if adaptive isn't used, on first launch neither result container will be visible. Now it uses project metadata to track the last mode, defaulting to list.
Grid items are slightly wider, showing 5 per row with the default window size, in order to show more of the filename.
In AcceptDialog, the vertical button margin isn't considered if the button hbox has min y size of 0 in order to reduce the "chin" on the quick open popup.

Oct 17 '24 20:10 a-johnston

@a-johnston How does this new algorithm work? There seems to be a lot more to it than the first two approaches. But it works pretty well. Opened a small PR to fix some C++ gotchas but whole thing looks good to me (though didn't look at any of the Dialog code).

Also, is there a way to not match "res://" ?

Is it correct that we rank HAT.wav first here?

Oct 18 '24 09:10 samsface

@samsface thanks for testing it out! The matching process is similar to the earlier ones, although there are definitely more helpers etc cluttering it up haha. It essentially is doing:

Match each token starting from 0
If a match is found, test to see if that match overlaps any previously matched sections and reject it if so
Otherwise, score the new match and potentially update the top scoring match
In either case of a match, move the offset from 0 to start + 1 and try again. If no match, quit

Scoring then prioritizes (in rough order of importance):

Exact token matches without breaks
Longer matched sections
Fewer missed query characters
Filename match
Word boundary match

Is it correct that we rank HAT.wav first here?

Results with equal scores are tie-broken first on length and then on their alphanumeric order, which is why HAT.wav is ordered before hat.wav -- I'll add a slight score deduction if the match relies on being case insensitive to break that in favor of the exact match.

Also, is there a way to not match "res://" ?

I hadn't even noticed that! Yeah I can add a way to the fuzzy search code to skip a given prefix, or we can change that part of the updated quick open popup to not include it to begin with.

Also thanks for mentioning the PR; I didn't get an email or notification about it so I would've missed it otherwise. TIL!

Oct 18 '24 17:10 a-johnston

I just updated it to 1) add a slight tie-breaker penalty if it relies on case insensitive matching and 2) ignore res:// by setting a new starting offset value. Ironically, similar to last time, I'll be out of town this weekend so slow updates from me for a bit.

Oct 18 '24 18:10 a-johnston

I wonder if FuzzySearch can be used for autocompletion 🤔 We used to have something similar, but it was reverted. Although for that, it can't be coupled to QuickOpenDialog.

Oct 18 '24 20:10 KoBeWi

I wonder if FuzzySearch can be used for autocompletion 🤔 We used to have something similar, but it was reverted. Although for that, it can't be coupled to QuickOpenDialog.

Thanks for the thorough pass! Someone had linked the prior merged and reverted subsequence matching stuff; I guess it picked up way too many low-quality results. This could be used in other places - it is definitely intentional that the fuzzy search class is distinct from the quick open stuff, and the use of RefCounted was I think originally so that it could be exposed in the gdscript api in a later change. It is somewhat coupled to the use of filepaths since it considers the last / index, but that just wouldn't be a factor for scoring non-path items.

Oct 18 '24 22:10 a-johnston

I tested the changes in my project with lots of files and the performance is fine.

The results are more or less as accurate, however for some weird reason, tokens are favored in reverse order sometimes: (left is old search, right is new)

Also the way matches are highlighted is a bit confusing: The path is positioned before the name, but in the dialog it appears below and gray. Makes the order unintuitive. I know the dialog was already like that and not sure how it can be improved. Technically it's correct, just pointing it out.

The two unlabeled buttons at the bottom look odd: It was acceptable when there was one, but now... idk.

There is also a regression: when search string is empty, the dialog displays nothing. Before this PR it would display recently opened entries.

Aside from the regression and token order weirdness, it's an improvement overall.

The code is probably ok. The FuzzySearch code is quite complex, so I mostly checked the style. As I noted in one of the comments, the fuzzy search classes don't need to be RefCounted. You can make them plain C++ classes, so they can be managed on stack instead of heap. It's tricky to use though (you'll need to pass by reference everywhere to avoid copying). If you plan to eventually try exposing it, you can leave FuzzySearchResult as RefCounted, as it seems to be the only class used non-internally.

I didn't check the added tests.

Oct 18 '24 22:10 KoBeWi

Thanks again for all the feedback. I don't think I'll get through this in the next few hours so I'll try to have an update ready next week.

tokens are favored in reverse order sometimes:

I think in this case all of the results in that right screenshot have an equal score. I'll add another tie breaking rule to favor in-order matches.

It was acceptable when there was one, but now... idk.

I agree; I'll add short labels for each one in addition to the current tooltips.

There is also a regression: when search string is empty, the dialog displays nothing. Before this PR it would display recently opened entries.

I'll need to check but I don't think I touched that logic. Iirc it only kicks in when you're searching for a single type, like with an export, and not when searching generally across more files/types https://github.com/godotengine/godot/blob/master/editor/gui/editor_quick_open_dialog.cpp#L662-L663 but I'll test and fix it once I'm through the other comments.

Oct 18 '24 23:10 a-johnston

As I noted in one of the comments, the fuzzy search classes don't need to be RefCounted

Yeah I think speed maters for search so would second this. Maybe best to do everything without Refs and then have a separate pr to expose this in gdsctipt that copied the results into a friendlier class for gdscript use.

Oct 19 '24 11:10 samsface

Sounds good, I'll convert it to not use RefCounted. Out of curiosity, I'll benchmark it before and after, and maybe add a few more benchmark cases. It makes sense to consider a better gdscript api; would that be its own proposal? I remember seeing a comment asking for this to be exposed to gdscript, but I'm assuming people would want more discussion about how exactly it should be exposed (like maybe as a string method search/search_all?)

Oct 22 '24 05:10 a-johnston

Sounds good, I'll convert it to not use RefCounted. Out of curiosity, I'll benchmark it before and after, and maybe add a few more benchmark cases. It makes sense to consider a better gdscript api; would that be its own proposal? I remember seeing a comment asking for this to be exposed to gdscript, but I'm assuming people would want more discussion about how exactly it should be exposed (like maybe as a string method search/search_all?)

Yeah a separate proposal makes total sense to me. Once the gdscript api goes in, we can’t break it without breaking everyone’s code using it. So best not delay this PR with a ton more discussion.

Oct 22 '24 08:10 samsface

Alright updated the branch with the following changes: (screenshots in top post are also updated)

New classes are not RefCounted anymore and copies are generally minimized
Results vector dynamically resizes so the setting no longer requires restart on change. Also this allows the destructor to be removed
A small tie-breaking bonus is added when tokens are matched in order
Regression with showing history is fixed
Labels have been added to each toggle, and the default window width has increased to account for this
A few new test cases have been added to try and stress different points of the search

Overall, although the new tie breaker does incur a small cost, this version of the code is ~30% faster according to the benchmark, with the speedup slightly increasing as the number of files increases. Queries with many tokens or token matches really benefit with close to a ~50% speedup.

Oct 24 '24 07:10 a-johnston

So uh The path should be above buttons. There is not enough space now. Also instead of separate labels, you can set text of the CheckButtons, so they are easier clickable.

Oct 24 '24 15:10 KoBeWi

Will this apply to the Command Palette too? I often find myself mis-spelling things or using alternative words in tools like VS Code and Rider, and still getting to the same result - but I have to type everything exactly in Godot. For instance, typing "options" for opening settings, and being allowed to spell it wrong

Oct 24 '24 17:10 btarg

@btarg not as part of this PR (it's already too big) but the fuzzy searching logic was intentionally kept separate so that it could be used elsewhere, just as a part of a different PR.

Oct 24 '24 17:10 a-johnston

There's one more optimization I'd like to apply (but in another PR) that in my testing cut most of the benchmarks by 50%. A cull during the search that discards results early if their score is considerably less than the current max. I noticed we spend a ton of time adding then removing results that had like, a score of 3 while the max sore is 100.

Ah yeah I was thinking about something similar but instead considering the future potential score a la an advent of code problem from last(?) year. If you have the branch ready, feel free to pr against https://github.com/a-johnston/godot/tree/fuzzy-search-rebase

Oct 25 '24 16:10 a-johnston

There's one more optimization I'd like to apply (but in another PR) that in my testing cut most of the benchmarks by 50%. A cull during the search that discards results early if their score is considerably less than the current max. I noticed we spend a ton of time adding then removing results that had like, a score of 3 while the max sore is 100.

Ah yeah I was thinking about something similar but instead considering the future potential score a la an advent of code problem from last(?) year. If you have the branch ready, feel free to pr against https://github.com/a-johnston/godot/tree/fuzzy-search-rebase

It's already fast enough (imo) so wouldn't want to delay the PR. Let me know if you need help with any coding work to get the PR merged.

Oct 25 '24 18:10 samsface

It's already fast enough (imo) so wouldn't want to delay the PR. Let me know if you need help with any coding work to get the PR merged.

Haha well you piqued my curiosity so I wanted to try it before nuking the benchmark. I settled on a pretty tame early cull criteria of being 1) negative (so, bad and missing query characters) and 2) at least 50 away from the current max score because I noticed that although the top results were of course preserved, the combination of match length being squared and a 100 bonus being given to exact token matches resulted in a query of ie spawner having a huge delta between spawn.tscn and spawner.tscn although to me the former still seems very relevant as a secondary result. From logging the early/late cull counts, this seems to have the most pronounced effect early when typing the query, when many more results are "valid" but negative, but later in the query most targets filter out with invalid matches. In the benchmark with this early cull criteria I saw a ~5% speedup for a synthetic 100k case and ~3% speedup for a 1k case, although in terms of the delta and not ratio, the 100k case had an average delta of ~3ms and the 1k case never had a delta greater than a few percent of a ms. That said, the benchmark is pretty biased due to the way I pruned down your tree to 1k lines.

Although the changes are pretty small I won't include them here but yeah for a future pr that addresses this

Keep in mind the nonlinear score bonuses and secondary results, not just the top result matching the benchmark's expected result
I've been changing the list item's name to name->set_text(p_candidate.file_path.get_file() + vformat(" (%d)", p_candidate.result->score)); to get a better intuition for the relative scores of displayed results

Oct 25 '24 19:10 a-johnston

Should I rebase these changes down to fewer total commits or is it fine as is?

Oct 28 '24 17:10 a-johnston

Yes please squash your commits into one, with a clear commit message with the same title as the PR

Oct 28 '24 18:10 AThousandShips

Not planning on making any more changes unless things are requested here but just to add visibility

There is an alternative RichTextLabel impl here https://github.com/godotengine/godot/pull/98278#discussion_r1818243991 which avoids the Label changes
It would be good (and pretty easy) to add additional editor options for what colors to use in the highlight border/fill. This would especially be useful for a11y / custom theme stuff. If this wasn't already reviewed I probably would add it in here.

Oct 29 '24 01:10 a-johnston

Thanks! Congratulations on your first contribution! 🎉

Nov 10 '24 18:11 Repiteo

Thank you!

every single change we've done so far brought its wave of discontents. Let's hope this time's the One Algorithm that Satisfies Them All.

Hopefully having added some relevant editor options helps with that :P

Nov 10 '24 20:11 a-johnston