stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

feat: add fuzzy matching algorithm for tab completions

Open Snehil-Shah opened this issue 11 months ago • 36 comments

Resolves #1845

Description

What is the purpose of this pull request?

This pull request:

  • adds fuzzy matching algorithm for tab completions
  • write a custom completer engine with navigation and highlighted completions

The algorithm revolves around scoring each completion based on the number of matching characters, and the distance between the matching characters in the completion & input. It easily catches minor spelling mistakes and matches 'similar looking' prefixes (at least from how much I have tested).

Complexity (assuming startsWith takes linear time):

Before: O(n) per completion, where n = length of input string. After: O(m+n) per completion, where n = length of input string & m = length of completion

Update: The algorithm is now updated.

Related Issues

Does this pull request have any related issues?

This pull request:

  • resolves #1845
  • relates to https://github.com/stdlib-js/google-summer-of-code/issues/1

Questions

Any questions for reviewers of this pull request?

right now fuzzy match will only work for completing expressions and not for incomplete workspaces, file system, tutorial and require statements, as for some reason they are not using the abstracted filterByPrefix method. Instead they are reimplementing it (see image).

complete_workspace.js

Would do the refactoring once the algorithm is finalized, to fix this!

Update: fixed!

Other

Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.

I tried other popular fuzzy auto-completion algorithms too (not many) but completions didn't feel as relevant as I expected from a code completer. The Levenshtein algorithm, for example, favors suggestions based on how 'similar looking' two strings are, meaning two strings with a higher difference in length would be scored worse (as it requires more insertions), which is not ideal for code completion.

What I haven't done much yet, is explore existing 'code completions' algorithms in IDEs/Editors and see how they are implementing them. Please tell me if I should do that.

The current implementation does work really well though based on the limited cases I've tested it through.

Update: Updated the algorithm!

And AFAIK, node.js and ipython don't have fuzzy auto-completion yet. (correct me if I'm wrong)

Checklist

Please ensure the following tasks are completed before submitting this pull request.


@stdlib-js/reviewers

Snehil-Shah avatar Mar 12 '24 23:03 Snehil-Shah

@kgryte Regarding visually highlighting the matching characters in the completions (ys -> yes).

I am a bit confused about how we can achieve this as we are relying on built-in readline module's completer for tab completions. I don't think we can alter on how the completer displays the completions.

One of the (rather hacky) ways I tried doing this is by including the bold outputs in the resulting array itself that is returned by the completer callback. This didn't go well with readline completer's feature where say I type common and the possible completions are commonKeys and commonKeysIn, If I click TAB, instead of showing both the completions, it auto-completes it to commonKeys as its the longest common substring of all completions. So in our case, if the first letter (say c) of input is common in all completions, {ascii-code}c{ascii-code} (our bold output) becomes the longest common substring of all completions causing completer to just auto-complete it instead of displaying all possible completions.

Is there any other way I can go around implementing this??

And regarding the algorithm, afaik this isn't based on any prior implementation/art. It's a simple algorithm that I wrote from scratch that matches based on how much the characters in the completion deviate from the input string (distance). The magic 0.25 is a weight I have iteratively reached at, and 0.7 is the threshold score above which matches are permissible.

Snehil-Shah avatar Mar 24 '24 09:03 Snehil-Shah

Hmm...yeah, I hadn't thought about the completer controlling the terminal output. My initial thought was escape sequences, but then you are right, I can see how that might be an issue with auto-completion.

Well...one possibility is to just to write our own completer. We don't have to use the built-in. Could write it from scratch. May be worth exploring the Node.js source implementation to see how it handles the completer logic.

kgryte avatar Mar 24 '24 09:03 kgryte

For IPython, looks like they use jedi for auto-completion. In Node.js, fuzzy-matching is not supported.

kgryte avatar Mar 24 '24 09:03 kgryte

For IPython, looks like they use jedi for auto-completion. In Node.js, fuzzy-matching is not supported.

Fuzzy matching is not supported in IPython either from what I know. Am I mistaken?

Snehil-Shah avatar Mar 24 '24 10:03 Snehil-Shah

For IPython, looks like you can enable jedi: https://ipython.readthedocs.io/en/stable/api/generated/IPython.core.completer.html#IPython.core.completer.Completer.use_jedi. In which case, I believe you can get fuzzy auto-complete in IPython: https://jedi.readthedocs.io/en/latest/docs/api.html?highlight=fuzzy#jedi.Script.complete.

kgryte avatar Mar 24 '24 10:03 kgryte

I haven't personally used jedi in IPython, so YMMV.

kgryte avatar Mar 24 '24 10:03 kgryte

@kgryte understood! I'll explore writing our own completer, and studying jedi's fuzzy matching algorithm👍

Snehil-Shah avatar Mar 24 '24 10:03 Snehil-Shah

Don't stray too far down the rabbit hole! It may be vast. 😅

kgryte avatar Mar 24 '24 10:03 kgryte

right now fuzzy match will only work for completing expressions and not for incomplete workspaces, file system, tutorial and require statements, as for some reason they are not using the abstracted filterByPrefix method.

This was intentional, as matching is "inlined" due to additional loop logic (e.g., complete_require.js). It should, however, be possible to swap out startsWith with fuzzyMatch, with the latter being a separate module, rather than a private function.

We'll want fuzzy matching to be enabled via a setting. This is another thing which users may have preferences over.

And as a first pass, we can merge basic functionality in first and then do character highlighting in a follow-up PR, as that is likely to require yet more refactoring and is not strictly necessary in order to get the benefits of fuzzy matching.

kgryte avatar Apr 02 '24 06:04 kgryte

In the JSDoc comment of fuzzyMatch, it would be good to explain the algorithm/approach a bit and include example values for each stage of the algorithm. It can be a bit hard to grok what you are doing just by looking at for loops.

kgryte avatar Apr 02 '24 06:04 kgryte

@kgryte I am almost done with the character highlighting part though, can I push the changes and we can iterate over it after?

Snehil-Shah avatar Apr 02 '24 08:04 Snehil-Shah

@Snehil-Shah Sure. In general, going forward, it would be good to split tasks which can be split into separate PRs, as any additional complexity will increase the risk of PRs getting bogged down in review and refactoring.

kgryte avatar Apr 02 '24 08:04 kgryte

@kgryte a progress report:

  1. Re-wrote the node.js readline completer: One of the small things I changed is, a single TAB will display the completions (or auto-complete if a common-prefix exists) instead of clicking TAB twice like it was previously. I feel like it's more intuitive. If there are reasons against it, I will change it..

  2. Re-wrote the algorithm after studying some of the references you gave, and tried to document it well. It's still a private method, not a module as I was mostly done with the algorithm today itself. Will move to refactoring it to a module. The current algorithm borrows the penalty system from the code mirrors's fuzzy matching algorithm. The algorithm also catches spelling mistakes (borrowing from my previous draft algorithm). The penalty fines and the LENIENCY FACTOR are values I have iteratively reached at while playing around, it will need more fine-tuning but it works good from my experience.

  3. Did some changes to sort the completions by relevancy (scores) and if two completions have the same scores, I sort them by length instead of lexicographic. I feel shorter completions feel more relevant if two completions have the same score.

  4. Support for highlighted completions:

image

  1. One of the problems that cames up was the preview completer started bugging as it was expecting the longest common prefix from the completions to startWith the input. That isn't the case with fuzzy completions.

    error_preview Added a small patch for that too. seems to work fine..

  2. This branch is currently outdated and doesn't have the settings command and API, will merge the develop branch and work on controlling fuzzy match via a setting

  3. About moving fuzzyMatch to a module, do you mean making a string package (@stdlib/string/fuzzyMatch)? Right now fuzzy matching is only implemented for expressions. Will replace the startWith with fuzzyMatch after we are done with the module

Snehil-Shah avatar Apr 02 '24 13:04 Snehil-Shah

  1. Module vs package. A module is single a file which has exports. A package is something that is independently installable. In this case, I was suggesting that you move fuzzyMatch to a separate file (i.e., a module).

kgryte avatar Apr 03 '24 03:04 kgryte

  1. Highlighted completions. IMO, the unordered-ness of the completion results is not good UX. It makes it rather difficult to find what you want, as your eyes have to scan each entry. With lexicographic, visual scanning is straightforward and predictable, allowing for faster manual lookup. I am not sure the ideal solution here.

kgryte avatar Apr 03 '24 03:04 kgryte

Actually, now that I am thinking about it, how useful is it to display a list of fuzzy TAB completions?

Let's take your cons example above. If I type cons and see iterConstant, I need to hit backspace to the beginning of the line and start over, typing iter followed by Cons. In short, without a mechanism for cycling among fuzzy completion options as displayed (e.g., via arrow keys as one might in a terminal menu), I am not sure how useful fuzzy TAB completions are.

Okay, so where does that leave us? Well, I think there is a use case for fuzzy completions and that is when I've entered some text and it doesn't correspond to any context variable which we can TAB complete. E.g., I've misspelled cons, maybe be entering cns, forgetting the o. When I hit TAB expecting completions without realizing that I've made a typo, that would be a good time to show fuzzy completions.

However, by default, I think it makes sense to use the prior completion logic when there are exact prefix matches or an exact match.

In short, fuzzy completions are likely most helpful when someone does something wrong, but not when they've done something right.

kgryte avatar Apr 03 '24 03:04 kgryte

In short, without a mechanism for cycling among fuzzy completion options as displayed (e.g., via arrow keys as one might in a terminal menu), I am not sure how useful fuzzy TAB completions are.

@kgryte I will work on the changes.

But, now that I am thinking about it, as we already wrote our own completer, we can surely have this feature as well where the user can cycle through the completions!

I tried playing around and made a quick prototype. Attaching below

https://github.com/stdlib-js/stdlib/assets/130062020/2d1168bf-cc54-4521-a90a-c8e23b1a6ae3

Unlike the previous completer, where completions were simply printed and we moved to the next prompt, now we stay at the same place and can see the completions below, we can further cycle through using arrow keys!

I think this is a really cool addition. Should I push this change?

Snehil-Shah avatar Apr 03 '24 23:04 Snehil-Shah

@Snehil-Shah The ability to arrow through fuzzy match options is useful. A few comments:

  1. We should avoid changing the line text, as otherwise a user cannot get back to what they have written. Instead, once a user hits the down arrow and enters the menu, the only thing that should change is what is highlighted.
  2. I'm still of the opinion that when there are exact prefix matches, you don't want fuzzy tab completion. E.g., I often use TAB completion to remind me the destination of what I want to type next, not in order to perform a quasi-search. For explicit search, I think it makes sense to have a dedicated API.

kgryte avatar Apr 05 '24 06:04 kgryte

@Snehil-Shah The ability to arrow through fuzzy match options is useful. A few comments:

  1. We should avoid changing the line text, as otherwise a user cannot get back to what they have written. Instead, once a user hits the down arrow and enters the menu, the only thing that should change is what is highlighted.
  2. I'm still of the opinion that when there are exact prefix matches, you don't want fuzzy tab completion. E.g., I often use TAB completion to remind me the destination of what I want to type next, not in order to perform a quasi-search. For search explicitly, I think it makes sense to have a dedicated API.

@kgryte with the current implementation, the user can get back to the previous line by instinctively clicking the up arrow beyond the first row. I feel this is ergonomic as the user can quickly navigate and then continue typing after the completion is inserted, and if they don't want the completion they can go back up to the line

A dedicated search API is definitely useful, but I feel the user still would want fuzzy completions being displayed if they don't exactly remember what they are searching for. Example: for a local variable name, it's unlikely the user would want to use a search API just to search a variable they defined above. If it looks messy, maybe we can add a space to separate exact matches and fuzzy matches? What do you think? Also with current implementation, the exact prefixes are shown first and in lexicographic order, and then the fuzzy completions are shown in relevant order (score).

Snehil-Shah avatar Apr 05 '24 06:04 Snehil-Shah

Feel free to try separating by a line and we can discuss whether it improves UX.

I recommend going ahead and reconciling the merge conflicts; otherwise, you're likely to make your life more painful later on.

Fuzzy completions raise a question as to where do you stop with regard to namespaces? E.g., if I type abs, we could reasonably return abs, iterAbs, etc from the current REPL context, but we could also search namespaces (e.g., base.abs). On the one hand, maybe I type abs and I want to know where all the "abs" functionality is. But that would increase complexity AND it begs where to stop. E.g., if I type toStr, do we then walk context variables to check for toString properties? I'm not sure.

Still not clear to me that TAB completion is actually the best use case for fuzzy match.

kgryte avatar Apr 05 '24 07:04 kgryte

@kgryte searching namespaces is not ideal for tab completions, because then it's trying too hard to be a search than a completer. Taking a step back, I think the purpose of a tab completer should be guessing what the user is trying to type, if the user has already typed base.a, then yes we should search that namespace like we already do. And as the user can make mistakes in typing or if the user doesn't exactly remember what to type, fuzzy completions play a good role there by guessing what they're trying to type. If they typed base.abd, the completer should know what they're trying to say.

Snehil-Shah avatar Apr 05 '24 08:04 Snehil-Shah

@kgryte Done with the implementation. Fuzzy completions is now controllable via settings, and also added a max height to the completions output, as, in cases where the output was taller than the terminal (causing scrolling), moveCursor wasn't working as expected.

Current behaviours of TAB completions:

  • going up to the line, brings back the original line and going to any completion automatically updates the line.

https://github.com/stdlib-js/stdlib/assets/130062020/83163553-414b-4542-bacc-4905f435c063

  • Hitting TAB is like a toggle button to bring up and hide the completions panel.

  • I tried separating the exact completions from the fuzzy ones and IMO they look worse:

image double

It was better before, and now that we have a setting to disable it, in my opinion, we can normally display them just after the exact completions as it doesn't hurt to have more completions that the user might be potentially going for, given the exact ones are up top already. And if the user doesn't want it, they can disable it via settings. But if you have a differing opinion, we can stick to your recommendation of using exact prefixes by default and when nothing matches, showing fuzzy completions.👍

Snehil-Shah avatar Apr 05 '24 16:04 Snehil-Shah

Re: line separation. I don't think I agree here. By mixing the exact match and fuzzy results, you are making the user search for the exact matches. They, themselves, have to identify the boundaries between fuzzy and exact match. And that is not ideal UX, especially as the fuzzy matches create visual noise.

Take your example above. For the exact matches, all matching text is lower case, but, for the fuzzy matches, you have a mix of lowercase, camelcase, and title case. The fuzzy completions are just inherently noisier, as there is no pattern. And because of that noise, you are imposing additional cognitive load on users to identify the "correct" match and to find how their query matches any one specific fuzzy match. In short, you are making users work. And users, generally, try to avoid work.

The line separation makes it clear what is exact and what is fuzzy. And a line separation provides structure to results for which, displayed together, lack obvious structure, until you spend time understanding order and implicit grouping. IMO, grouping together is worse UX.

On another note, I am wondering if we should limit the number of results in order to avoid choice paralysis. Did a quick search (ref 1 and ref 2), and seems generally agreed to limit fuzzy results to no more than 10 items. Anything more and users spend more time searching than auto-completing.

kgryte avatar Apr 05 '24 19:04 kgryte

@kgryte I understand. Regarding limiting the results, are we talking about just limiting the fuzzy results or all completions in general? Because if we limit even exact prefixes, then we are losing potential completions that the user is actually trying to find as the exact prefixes are sorted lexicographically and not based on some order of relevancy. in short, on what basis will we be discarding exact prefixes by limiting overall completions to 10? Limiting fuzzy results does make sense though as we have a basis for discarding completions (ie. score). And what are your thoughts on showing fuzzy matches only when there are no exact prefixes? It would lead to lesser and cleaner completions and better perf..

Snehil-Shah avatar Apr 05 '24 20:04 Snehil-Shah

Agreed. Restricting number of completion candidates only makes sense for fuzzy completions.

And yeah, I still lean toward fuzzy completions only when no exact prefix matches.

kgryte avatar Apr 05 '24 20:04 kgryte

Also, fuzzy completions only when no exact match is the less invasive choice to start. If users start clamoring for fuzzy completions even when there are exact prefix matches, should be straightforward to refactor to match the current behavior. And we have this PR as a reference.

kgryte avatar Apr 05 '24 20:04 kgryte

@kgryte great. Will go ahead with fuzzy completions only when there's no exact matches. And now that we are doing this, does it make sense to limit results? When we are displaying fuzzy completions, it means there were no exact matches. Presumably, that means the fuzzy completions too wouldn't be that much. Plus as there are no exact completions, there's a chance the user really needs help there, and limiting fuzzy completions would be counter intuitive imo. What do you think?

Snehil-Shah avatar Apr 05 '24 20:04 Snehil-Shah

Maybe make the number of displayed fuzzy completions a setting?

kgryte avatar Apr 05 '24 21:04 kgryte

@kgryte we can do that. although I wonder if it's that user-friendly given it's generally abstracted to just a true or false in most editors🤔

Snehil-Shah avatar Apr 05 '24 21:04 Snehil-Shah

Another option would be to provide a setting (or settings) for users to adjust the magic parameters which are used to score results. That would achieve the same end as limiting the number of results. But, then again, we could change the underlying scoring algo in the future.

Regardless, I am not sure that users benefit all that much from having more than 10 fuzzy choices. One option to make it more "powerful" or relevant is to weight items in a user's history higher. But this makes the implementation more complex.

I am tempted to just say limit to 10. If people want more, we can easily make this adjustable.

kgryte avatar Apr 05 '24 21:04 kgryte