zoxide Score directories based on the keywords searched

Use word boundary detection and give scores based on keyword matching

I'm using a helper library to implement a unicode algorithm, but I'm also detecting case changes within a word (from lower to upper case, or no case to some case, so the "o" in "Documents" doesn't count as a new word).

Words are not searched--rather, the string is searched starting at a word boundary. That way multi-word sequences will correctly match.

This is a basic solution to #260. There's lots of room for improving the algorithm, but IMO this is a big UX improvement already.

Some things to consider:

We don't have options to control the case. If smart-case is disabled, the weights in compute_kw_score need to change.
Right now keyword score totally overpowers the frequency score. The frequency score is only a tie-breaker. They could be normalized and weighted so a much better frequency score would win despite a slightly worse keyword score.
Should we detect word endings for exact matches? I'm not sure it would give a good user experience. If I frequently access "src9" but not "src", I don't want "src" to win just because it more exactly matches what I typed. It's hard to refrain from typing a whole word. This gives an interesting wrong result with "c" being an ideal match for "/mnt/c/anything".
The above issue can be mitigated but not solved if we consider digits to start a new word.
I'm testing these changes with a command like cargo r -- query --list --score b.

Sep 17 '21 14:09 lefth

@ajeetdsouza Do you know if some people have really massive directory databases? I optimized this more than might strictly be necessary if it's unlikely anyone has more than a couple hundred directories.

Also related to that: if you don't like the extra complexity of adding a new iterator class and delaying sorting, I can do away with that by sorting in the constructor as before, either by sorting again when the parameters and excludes change, or by removing methods like with_keywords and passing that data to the constructor instead.

Sep 25 '21 05:09 lefth

I optimized this more than might strictly be necessary

Optimization would be great, but we should focus on finalizing the algorithm first. I haven't had time to look at this PR yet, but hopefully I'll be able to check it out soon.

On a related note, I tried searching for how similar tools handle a z foo command with directories /foo and /foobar, since that seems to be one of the primary motives of this change.

autojump The solution given was to manually lower the score of the /foobar directory, or re-run the z command since the current directory is excluded (https://github.com/wting/autojump/issues/533).

z The algorithm was changed to always select the shortest match if the prefix is common among all results. The author comments that the solution may have been suboptimal (https://github.com/rupa/z/issues/35). Someone else suggested going to an exact match if there is a query like z foo/, which sounds doable (https://github.com/rupa/z/issues/303).

Since we're already analyzing the querying algorithm, here's another suggestion from https://github.com/ajeetdsouza/zoxide/issues/263: for a user named foo, z foo bar should probably prefer /home/foo/dir/foo/bar rather than /home/foo/bar. It's unlikely that the user would actually mention their username when searching for a path.

Sep 26 '21 23:09 ajeetdsouza

@ajeetdsouza What would it take for you to accept this pull request? Simpler code, smaller changes?

I suggest this change is a key step toward improving the algorithm--improvement can take a faster pace after the search string influences the score, but after that step a lot of tweaking is possible.

Nov 26 '21 05:11 lefth

BTW, this will obviously still need tweaking after it's released. There's no way to make it perfect at the first go without a lot of beta testers. To tweak, I'm thinking we could add to add this to the help string:

If a query takes you to a directory it shouldn't (if the smart logic for picking a directory isn't smart enough), please comment on <some issue URL> with this query and the first few lines of output, so we can see directory scores and judge if the matcher needs tweaking: zoxide --version && zoxide query -ls <the query>

Nov 27 '21 03:11 lefth

Hey @lefth, sorry for the delay, I've been having a very busy month! I took a look at the changes, I put in my thoughts below.

I don't think there's any need to change the DirDisplayScore implementation. The purpose of including score in the output is so that users can implement custom functions in their shell configurations -- essentially, zoxide query -ls | sort should sort results by score. Thus, we should ideally change the score function to return the frecency combined with the match score.

This gives an interesting wrong result with c being an ideal match for /mnt/c/anything.

I think that requiring the last component of the path to match the last component of the query works really well, and should not be removed. If you really want to match the highest ranked subdirectory under foo, you can always use a query like z foo /.

Part of this discussion was inspired by the fact that there is no way in zoxide (or in fact, any other autojumper) to cd into /foo if there was a directory called /foobar that was more highly ranked. Are we tackling this problem in any way?

The one way I can think of is to improve the interactive search feature to the point that it becomes a seamless part of one's workflow. This is why I tried to prioritize https://github.com/ajeetdsouza/zoxide/pull/257 -- I wanted a user to simply press <SPACE><TAB> and get to interactive selection. This is now available on bash and fish, and I'm bringing it to zsh once I get the time.

I've mentioned this elsewhere, but one of the things I really like about zoxide is the predictability of the algorithm -- even when it's wrong, it's usually not hard to understand why and adapt the query accordingly. The more complex the algorithm becomes, the harder it becomes for people to understand why zoxide picked the directory it did, and the more frustrating it gets when zoxide jumps to the wrong directory.

Ideally, I'd think a good query algorithm would work the same way the current algorithm does, but would assign some weight (in descending order) to-

exact matches of components (split by /; example: component is foo and query is foo)
left-side matches of components (example: component is foobar and query is foo)
exact matches of words (split by unicode word separators)
left-hand matches of words
regular matches (the way they are now)

Still more ambitious is smartcase and unicode normalization. After this, we could sum up the weights and find the one with the highest score. I'm not sure how this would be implemented, but what do you think of the general idea?

Nov 28 '21 19:11 ajeetdsouza

Closing due to inactivity, feel free to open an issue to discuss this further.

Sep 06 '22 07:09 ajeetdsouza