stdlib
stdlib copied to clipboard
feat: add fuzzy matching algorithm for tab completions
Resolves #1845
Description
What is the purpose of this pull request?
This pull request:
- adds fuzzy matching algorithm for tab completions
- write a custom completer engine with navigation and highlighted completions
The algorithm revolves around scoring each completion based on the number of matching characters, and the distance between the matching characters in the completion & input. It easily catches minor spelling mistakes and matches 'similar looking' prefixes (at least from how much I have tested).
Complexity (assuming startsWith
takes linear time):
Before: O(n) per completion, where n = length of input string. After: O(m+n) per completion, where n = length of input string & m = length of completion
Update: The algorithm is now updated.
Related Issues
Does this pull request have any related issues?
This pull request:
- resolves #1845
- relates to https://github.com/stdlib-js/google-summer-of-code/issues/1
Questions
Any questions for reviewers of this pull request?
right now fuzzy match will only work for completing expressions and not for incomplete workspaces, file system, tutorial and require statements, as for some reason they are not using the abstracted filterByPrefix
method. Instead they are reimplementing it (see image).
Would do the refactoring once the algorithm is finalized, to fix this!
Update: fixed!
Other
Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.
I tried other popular fuzzy auto-completion algorithms too (not many) but completions didn't feel as relevant as I expected from a code completer. The Levenshtein algorithm, for example, favors suggestions based on how 'similar looking' two strings are, meaning two strings with a higher difference in length would be scored worse (as it requires more insertions), which is not ideal for code completion.
What I haven't done much yet, is explore existing 'code completions' algorithms in IDEs/Editors and see how they are implementing them. Please tell me if I should do that.
The current implementation does work really well though based on the limited cases I've tested it through.
Update: Updated the algorithm!
And AFAIK, node.js and ipython don't have fuzzy auto-completion yet. (correct me if I'm wrong)
Checklist
Please ensure the following tasks are completed before submitting this pull request.
- [x] Read, understood, and followed the contributing guidelines.
@stdlib-js/reviewers
@kgryte Regarding visually highlighting the matching characters in the completions (ys -> yes).
I am a bit confused about how we can achieve this as we are relying on built-in readline module's completer for tab completions. I don't think we can alter on how the completer displays the completions.
One of the (rather hacky) ways I tried doing this is by including the bold outputs in the resulting array itself that is returned by the completer callback.
This didn't go well with readline completer's feature where say I type common
and the possible completions are commonKeys
and commonKeysIn
, If I click TAB, instead of showing both the completions, it auto-completes it to commonKeys
as its the longest common substring of all completions.
So in our case, if the first letter (say c) of input is common in all completions, {ascii-code}c{ascii-code}
(our bold output) becomes the longest common substring of all completions causing completer to just auto-complete it instead of displaying all possible completions.
Is there any other way I can go around implementing this??
And regarding the algorithm, afaik this isn't based on any prior implementation/art. It's a simple algorithm that I wrote from scratch that matches based on how much the characters in the completion deviate from the input string (distance).
The magic 0.25
is a weight I have iteratively reached at, and 0.7
is the threshold score above which matches are permissible.
Hmm...yeah, I hadn't thought about the completer controlling the terminal output. My initial thought was escape sequences, but then you are right, I can see how that might be an issue with auto-completion.
Well...one possibility is to just to write our own completer. We don't have to use the built-in. Could write it from scratch. May be worth exploring the Node.js source implementation to see how it handles the completer logic.
For IPython, looks like they use jedi for auto-completion. In Node.js, fuzzy-matching is not supported.
For IPython, looks like they use jedi for auto-completion. In Node.js, fuzzy-matching is not supported.
Fuzzy matching is not supported in IPython either from what I know. Am I mistaken?
For IPython, looks like you can enable jedi: https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.completer.html#IPython.core.completer.Completer.use_jedi. In which case, I believe you can get fuzzy auto-complete in IPython: https://jedi.readthedocs.io/en/latest/docs/api.html?highlight=fuzzy#jedi.Script.complete.
I haven't personally used jedi in IPython, so YMMV.
@kgryte understood! I'll explore writing our own completer, and studying jedi's fuzzy matching algorithm👍
Don't stray too far down the rabbit hole! It may be vast. 😅
right now fuzzy match will only work for completing expressions and not for incomplete workspaces, file system, tutorial and require statements, as for some reason they are not using the abstracted filterByPrefix method.
This was intentional, as matching is "inlined" due to additional loop logic (e.g., complete_require.js
). It should, however, be possible to swap out startsWith
with fuzzyMatch
, with the latter being a separate module, rather than a private function.
We'll want fuzzy matching to be enabled via a setting. This is another thing which users may have preferences over.
And as a first pass, we can merge basic functionality in first and then do character highlighting in a follow-up PR, as that is likely to require yet more refactoring and is not strictly necessary in order to get the benefits of fuzzy matching.
In the JSDoc comment of fuzzyMatch
, it would be good to explain the algorithm/approach a bit and include example values for each stage of the algorithm. It can be a bit hard to grok what you are doing just by looking at for
loops.
@kgryte I am almost done with the character highlighting part though, can I push the changes and we can iterate over it after?
@Snehil-Shah Sure. In general, going forward, it would be good to split tasks which can be split into separate PRs, as any additional complexity will increase the risk of PRs getting bogged down in review and refactoring.
@kgryte a progress report:
-
Re-wrote the node.js readline completer: One of the small things I changed is, a single TAB will display the completions (or auto-complete if a common-prefix exists) instead of clicking TAB twice like it was previously. I feel like it's more intuitive. If there are reasons against it, I will change it..
-
Re-wrote the algorithm after studying some of the references you gave, and tried to document it well. It's still a private method, not a module as I was mostly done with the algorithm today itself. Will move to refactoring it to a module. The current algorithm borrows the penalty system from the code mirrors's fuzzy matching algorithm. The algorithm also catches spelling mistakes (borrowing from my previous draft algorithm). The penalty fines and the
LENIENCY FACTOR
are values I have iteratively reached at while playing around, it will need more fine-tuning but it works good from my experience. -
Did some changes to sort the completions by relevancy (scores) and if two completions have the same scores, I sort them by length instead of lexicographic. I feel shorter completions feel more relevant if two completions have the same score.
-
Support for highlighted completions:
-
One of the problems that cames up was the preview completer started bugging as it was expecting the longest common prefix from the completions to
startWith
the input. That isn't the case with fuzzy completions.Added a small patch for that too. seems to work fine..
-
This branch is currently outdated and doesn't have the settings command and API, will merge the develop branch and work on controlling fuzzy match via a setting
-
About moving fuzzyMatch to a module, do you mean making a string package (
@stdlib/string/fuzzyMatch
)? Right now fuzzy matching is only implemented for expressions. Will replace thestartWith
withfuzzyMatch
after we are done with the module
- Module vs package. A module is single a file which has exports. A package is something that is independently installable. In this case, I was suggesting that you move
fuzzyMatch
to a separate file (i.e., a module).
- Highlighted completions. IMO, the unordered-ness of the completion results is not good UX. It makes it rather difficult to find what you want, as your eyes have to scan each entry. With lexicographic, visual scanning is straightforward and predictable, allowing for faster manual lookup. I am not sure the ideal solution here.
Actually, now that I am thinking about it, how useful is it to display a list of fuzzy TAB completions?
Let's take your cons
example above. If I type cons
and see iterConstant
, I need to hit backspace to the beginning of the line and start over, typing iter
followed by Cons
. In short, without a mechanism for cycling among fuzzy completion options as displayed (e.g., via arrow keys as one might in a terminal menu), I am not sure how useful fuzzy TAB completions are.
Okay, so where does that leave us? Well, I think there is a use case for fuzzy completions and that is when I've entered some text and it doesn't correspond to any context variable which we can TAB complete. E.g., I've misspelled cons
, maybe be entering cns
, forgetting the o
. When I hit TAB expecting completions without realizing that I've made a typo, that would be a good time to show fuzzy completions.
However, by default, I think it makes sense to use the prior completion logic when there are exact prefix matches or an exact match.
In short, fuzzy completions are likely most helpful when someone does something wrong, but not when they've done something right.
In short, without a mechanism for cycling among fuzzy completion options as displayed (e.g., via arrow keys as one might in a terminal menu), I am not sure how useful fuzzy TAB completions are.
@kgryte I will work on the changes.
But, now that I am thinking about it, as we already wrote our own completer, we can surely have this feature as well where the user can cycle through the completions!
I tried playing around and made a quick prototype. Attaching below
https://github.com/stdlib-js/stdlib/assets/130062020/2d1168bf-cc54-4521-a90a-c8e23b1a6ae3
Unlike the previous completer, where completions were simply printed and we moved to the next prompt, now we stay at the same place and can see the completions below, we can further cycle through using arrow keys!
I think this is a really cool addition. Should I push this change?
@Snehil-Shah The ability to arrow through fuzzy match options is useful. A few comments:
- We should avoid changing the line text, as otherwise a user cannot get back to what they have written. Instead, once a user hits the down arrow and enters the menu, the only thing that should change is what is highlighted.
- I'm still of the opinion that when there are exact prefix matches, you don't want fuzzy tab completion. E.g., I often use TAB completion to remind me the destination of what I want to type next, not in order to perform a quasi-search. For explicit search, I think it makes sense to have a dedicated API.
@Snehil-Shah The ability to arrow through fuzzy match options is useful. A few comments:
- We should avoid changing the line text, as otherwise a user cannot get back to what they have written. Instead, once a user hits the down arrow and enters the menu, the only thing that should change is what is highlighted.
- I'm still of the opinion that when there are exact prefix matches, you don't want fuzzy tab completion. E.g., I often use TAB completion to remind me the destination of what I want to type next, not in order to perform a quasi-search. For search explicitly, I think it makes sense to have a dedicated API.
@kgryte with the current implementation, the user can get back to the previous line by instinctively clicking the up arrow beyond the first row. I feel this is ergonomic as the user can quickly navigate and then continue typing after the completion is inserted, and if they don't want the completion they can go back up to the line
A dedicated search API is definitely useful, but I feel the user still would want fuzzy completions being displayed if they don't exactly remember what they are searching for. Example: for a local variable name, it's unlikely the user would want to use a search API just to search a variable they defined above. If it looks messy, maybe we can add a space to separate exact matches and fuzzy matches? What do you think? Also with current implementation, the exact prefixes are shown first and in lexicographic order, and then the fuzzy completions are shown in relevant order (score).
Feel free to try separating by a line and we can discuss whether it improves UX.
I recommend going ahead and reconciling the merge conflicts; otherwise, you're likely to make your life more painful later on.
Fuzzy completions raise a question as to where do you stop with regard to namespaces? E.g., if I type abs
, we could reasonably return abs
, iterAbs
, etc from the current REPL context, but we could also search namespaces (e.g., base.abs
). On the one hand, maybe I type abs
and I want to know where all the "abs" functionality is. But that would increase complexity AND it begs where to stop. E.g., if I type toStr
, do we then walk context variables to check for toString
properties? I'm not sure.
Still not clear to me that TAB completion is actually the best use case for fuzzy match.
@kgryte searching namespaces is not ideal for tab completions, because then it's trying too hard to be a search than a completer.
Taking a step back, I think the purpose of a tab completer should be guessing what the user is trying to type, if the user has already typed base.a
, then yes we should search that namespace like we already do.
And as the user can make mistakes in typing or if the user doesn't exactly remember what to type, fuzzy completions play a good role there by guessing what they're trying to type. If they typed base.abd
, the completer should know what they're trying to say.
@kgryte Done with the implementation.
Fuzzy completions is now controllable via settings, and also added a max height to the completions output, as, in cases where the output was taller than the terminal (causing scrolling), moveCursor
wasn't working as expected.
Current behaviours of TAB completions:
- going up to the line, brings back the original line and going to any completion automatically updates the line.
https://github.com/stdlib-js/stdlib/assets/130062020/83163553-414b-4542-bacc-4905f435c063
-
Hitting TAB is like a toggle button to bring up and hide the completions panel.
-
I tried separating the exact completions from the fuzzy ones and IMO they look worse:
It was better before, and now that we have a setting to disable it, in my opinion, we can normally display them just after the exact completions as it doesn't hurt to have more completions that the user might be potentially going for, given the exact ones are up top already. And if the user doesn't want it, they can disable it via settings. But if you have a differing opinion, we can stick to your recommendation of using exact prefixes by default and when nothing matches, showing fuzzy completions.👍
Re: line separation. I don't think I agree here. By mixing the exact match and fuzzy results, you are making the user search for the exact matches. They, themselves, have to identify the boundaries between fuzzy and exact match. And that is not ideal UX, especially as the fuzzy matches create visual noise.
Take your example above. For the exact matches, all matching text is lower case, but, for the fuzzy matches, you have a mix of lowercase, camelcase, and title case. The fuzzy completions are just inherently noisier, as there is no pattern. And because of that noise, you are imposing additional cognitive load on users to identify the "correct" match and to find how their query matches any one specific fuzzy match. In short, you are making users work. And users, generally, try to avoid work.
The line separation makes it clear what is exact and what is fuzzy. And a line separation provides structure to results for which, displayed together, lack obvious structure, until you spend time understanding order and implicit grouping. IMO, grouping together is worse UX.
On another note, I am wondering if we should limit the number of results in order to avoid choice paralysis. Did a quick search (ref 1 and ref 2), and seems generally agreed to limit fuzzy results to no more than 10 items. Anything more and users spend more time searching than auto-completing.
@kgryte I understand. Regarding limiting the results, are we talking about just limiting the fuzzy results or all completions in general? Because if we limit even exact prefixes, then we are losing potential completions that the user is actually trying to find as the exact prefixes are sorted lexicographically and not based on some order of relevancy. in short, on what basis will we be discarding exact prefixes by limiting overall completions to 10? Limiting fuzzy results does make sense though as we have a basis for discarding completions (ie. score). And what are your thoughts on showing fuzzy matches only when there are no exact prefixes? It would lead to lesser and cleaner completions and better perf..
Agreed. Restricting number of completion candidates only makes sense for fuzzy completions.
And yeah, I still lean toward fuzzy completions only when no exact prefix matches.
Also, fuzzy completions only when no exact match is the less invasive choice to start. If users start clamoring for fuzzy completions even when there are exact prefix matches, should be straightforward to refactor to match the current behavior. And we have this PR as a reference.
@kgryte great. Will go ahead with fuzzy completions only when there's no exact matches. And now that we are doing this, does it make sense to limit results? When we are displaying fuzzy completions, it means there were no exact matches. Presumably, that means the fuzzy completions too wouldn't be that much. Plus as there are no exact completions, there's a chance the user really needs help there, and limiting fuzzy completions would be counter intuitive imo. What do you think?
Maybe make the number of displayed fuzzy completions a setting?
@kgryte we can do that. although I wonder if it's that user-friendly given it's generally abstracted to just a true or false in most editors🤔
Another option would be to provide a setting (or settings) for users to adjust the magic parameters which are used to score results. That would achieve the same end as limiting the number of results. But, then again, we could change the underlying scoring algo in the future.
Regardless, I am not sure that users benefit all that much from having more than 10 fuzzy choices. One option to make it more "powerful" or relevant is to weight items in a user's history higher. But this makes the implementation more complex.
I am tempted to just say limit to 10. If people want more, we can easily make this adjustable.