atuin Improper prioritization using fuzzy search

Improper prioritization using fuzzy search

Open jemag opened this issue 1 year ago • 11 comments

It seems that the fuzzy search is improperly prioritizing despite changes made in https://github.com/ellie/atuin/pull/179 For example: The following interactive query tg plan will result in: Despite the fact that there are results for tg plan -out that should be of much higher priority: In fact, the tg plan command does not even appear anywhere within the list of 50+ items

atuin version : 12.0.0

Jan 28 '23 16:01 jemag

After a bit more testing, it seems that the results of atuin search "tg plan" are actually different from the ones in interactive mode (not sure why that is the case) and seem a bit better: Although there is still weird fuzzy searching behavior there.

For example, why would k get ing publicapi-ingress be ranked higher than tg apply aks-tfplan? The latter contains the full sequential tg and plan.

Same thing for npm list -g | grep -i lang. I don't see why it would rank above tg apply kv-tfplan which once again contains both tg and plan

Jan 28 '23 17:01 jemag

because the minimum span of the first one is shorter than the second, and there is no prioritisation applied for contiguous letters.

ghci> length "t ing publicapi-in"
18
ghci> length "tg apply aks-tfplan"
19

If you like a different algorithm you could certainly write one.

Jan 31 '23 03:01 mwotton

Thank you for the clarification on the current algorithm. I guess this is subjective, but I do think most users would expect the 2nd choice to be chosen in that situation (despite one being shorter).:

ghci> length "t ing publicapi-in"
18
ghci> length "tg apply aks-tfplan"
19

Would it not be possible to re-use fzf, skim or others instead of atuin trying to write its own fuzzy algorithm from scratch?

Also, in regards to the different results between the search command and interactive mode shown in the original post, is that expected?

Jan 31 '23 03:01 jemag

It's a fair criticism. I think there will always be complaints that the fuzzy search doesn't get exactly what you want.

I don't mind swapping out the fuzzy implementation, but first and foremost it needs to be fast. Reading 100000+ history entries from the database and shelling them out to fzf will be too slow for us.

I'm considering putting together a demo of skim. But I'm more interested in switching to tantivy for a full-text search engine.

Feb 06 '23 13:02 conradludgate

Ok, on my 40000 entries, reading the entire database and removing duplicates takes 140ms. Each query of those 40000 entries takes 15ms using skim. It's not awful. you certainly can run the query and render it in a couple frames.

Feb 06 '23 16:02 conradludgate

@jemag if you're willing, can you try #695 and report if you find the results subjectively better for your tastes? Is there any noticeable change in performance?

Feb 06 '23 17:02 conradludgate

Sure, and thanks for going through this experimentation.

I am a bit busy at the moment, but will definitely find some time this week to try it.

Feb 06 '23 17:02 jemag

PR works quite well for me. I do feel like the results match more closely to traditional fuzzy finders (and what many users may have come to expect over the years). In terms of interactive performance, I really could not notice it with my ~36k entries, although if I time a non interactive search it does show up as slightly slower.

In both situations (PR or not) the results from an interactive and a non-interactive search are slightly different, is there a particular reason for this?

Feb 12 '23 17:02 jemag

Most of the time I want to do an exact match, so in the past fzf with the --exact option worked perfectly for me. This option inverts the meaning of the ' prefix, so that by default it does an exact match unless you use the ' prefix. I do want to be able to match words in an arbitrary order though to narrow down the results, so the fulltext option doesn't work for me.

I was actually tempted to investigate adding this kind of option to atuin, but if there's potentially going to be a very different matcher with tantivy in the future then I'm not sure if it makes sense to spend time on that now.

I also gave the "skim" option a try, but with my ~360k entries the initial search was noticeably slower, and for some reason I can't get the ' prefix to work with it even though its documentation says that it is supported.

Apr 09 '23 01:04 majutsushi

like @majutsushi I normally prefer exact matches in arbitrary order, I rarely do typos so my need for fuzzy search is not very high. And just today I had a case where I have two servers, "abc.domain.com" and "abctest.domain.com", when I typed "ssh abc.domain.com" in the search, the best match was still "ssh abctest.domain.com", which just feels weird since it was an exact match. The line I typed was still the second match, so it wasn't far away, but still strange.

So, for me, I would prefer to see exact matches ranked much higher than fuzzy matches, or modify fulltext to treat space like AND to get exact matches in arbitrary orders... that would also be more in line with what I expect from a fulltext search.

May 09 '23 06:05 snaggen

A prefer_exact_match config flag that reverses the behavior of ' would be great!

Oct 04 '23 15:10 hashworks

atuin atuin copied to clipboard

Improper prioritization using fuzzy search

atuin
atuin copied to clipboard