uFuzzy icon indicating copy to clipboard operation
uFuzzy copied to clipboard

maybe detect acronyms?

Open leeoniya opened this issue 1 year ago • 4 comments

an option to detect acronyms in the needle might be interesting, but also tricky

Teenage Mutant Ninja Turtles, commonly abbreviated as TMNT, is an American media franchise created by the comic book artists Kevin Eastman and Peter Laird.

searching for TMNT would modify the term to t m n t and maybe interLft: 2. not sure this can actually work. e.g. NASA and NBA is never actually spelled out. plus interLft: 2 affects the whole needle, so would have unwanted side-effects. always possible to do better discarding for acronyms after initial filter, or maybe not...

leeoniya avatar Oct 02 '23 14:10 leeoniya

farzher/fuzzysort handles acronyms pretty well right?

image

I think after modifying to t m n t you could add sort function that prioritizes prioritizes prefix matches?

theBowja avatar Feb 09 '24 04:02 theBowja

the main problem here is knowing if you have an acronym or not. once we know that, we can easily set the correct options:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&search=t%20m%20n%20t&interLft=2

how do we know that lowercase tmnt is an acronym but fast is not? you would not want "fast" to match "for a strange test"

leeoniya avatar Feb 09 '24 17:02 leeoniya

I see. I think it probably goes more into search relevance than fuzzy search can handle. But I believe there are some rules we can use to get close to good results.

Exact full term matches:

  • Self-explanatory.
  • Example: for haystack ["fast", "for a strange test"], searching fast should return "fast" first because it is an exact match.

Exact acronym matches:

  • I think this is what we're really interested in.
  • Generally we search for acronyms by providing the entire acronym as needle.
  • Example: for haystack ["faster", "for a strange test"], searching fast should return "for a strange test" first because it is an exact match against the first character of each "word" within "for a strange test".

Partial acronym matches:

  • There are special considerations where things get tough.
  • Easy example: we don't match tmn against "Teenage Mutant Ninja Turtles" because it is a typo.
  • Hard example: for haystack ["Code Vein", "Call of Duty: Black Ops"], what should be the best result of searching cod? Based on previous rules, it should be "Code Vein". It is reasonable to expect that the user will modify the search term to codbo if they actually wanted to get "Call of Duty: Black Ops".
  • Another example: what about "Teenage Mutant Ninja Turtles: Mutants in Manhattan"? Let's say that the popular acronym for it is tmnt mm. In this case, it'll no longer be an exact acronym match. But an observation is that the longer the acronym, the lower the chance that it forms an actual word that collides with our desired expanded acronym. So it should still be ok following the previous rules. Probably.

theBowja avatar Feb 09 '24 18:02 theBowja

im not sure this belongs in the core, honestly. you can simply pre-process the needle and create a few different needles + ufuzzy options, and just do a several independent searches, then combine and sort the results as you see fit. it will be slightly slower but that's an okay trade-off to keeping the internals relatively unopinionated and straightforward.

  • Example: for haystack ["faster", "for a strange test"], searching fast should return "for a strange test" first because it is an exact match against the first character of each "word" within "for a strange test".

that's just your preference. many others (myself included) would expect the prefix match first. it's not black and white, unfort.

leeoniya avatar Feb 09 '24 18:02 leeoniya