uFuzzy
uFuzzy copied to clipboard
maybe detect acronyms?
an option to detect acronyms in the needle might be interesting, but also tricky
Teenage Mutant Ninja Turtles, commonly abbreviated as TMNT, is an American media franchise created by the comic book artists Kevin Eastman and Peter Laird.
searching for TMNT
would modify the term to t m n t
and maybe interLft: 2
. not sure this can actually work. e.g. NASA and NBA is never actually spelled out. plus interLft: 2
affects the whole needle, so would have unwanted side-effects. always possible to do better discarding for acronyms after initial filter, or maybe not...
farzher/fuzzysort handles acronyms pretty well right?
I think after modifying to t m n t
you could add sort function that prioritizes prioritizes prefix matches?
the main problem here is knowing if you have an acronym or not. once we know that, we can easily set the correct options:
https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&search=t%20m%20n%20t&interLft=2
how do we know that lowercase tmnt
is an acronym but fast
is not? you would not want "fast" to match "for a strange test"
I see. I think it probably goes more into search relevance than fuzzy search can handle. But I believe there are some rules we can use to get close to good results.
Exact full term matches:
- Self-explanatory.
- Example: for haystack
["fast", "for a strange test"]
, searchingfast
should return"fast"
first because it is an exact match.
Exact acronym matches:
- I think this is what we're really interested in.
- Generally we search for acronyms by providing the entire acronym as needle.
- Example: for haystack
["faster", "for a strange test"]
, searchingfast
should return"for a strange test"
first because it is an exact match against the first character of each "word" within"for a strange test"
.
Partial acronym matches:
- There are special considerations where things get tough.
- Easy example: we don't match
tmn
against"Teenage Mutant Ninja Turtles"
because it is a typo. - Hard example: for haystack
["Code Vein", "Call of Duty: Black Ops"]
, what should be the best result of searchingcod
? Based on previous rules, it should be"Code Vein"
. It is reasonable to expect that the user will modify the search term tocodbo
if they actually wanted to get"Call of Duty: Black Ops"
. - Another example: what about
"Teenage Mutant Ninja Turtles: Mutants in Manhattan"
? Let's say that the popular acronym for it istmnt mm
. In this case, it'll no longer be an exact acronym match. But an observation is that the longer the acronym, the lower the chance that it forms an actual word that collides with our desired expanded acronym. So it should still be ok following the previous rules. Probably.
im not sure this belongs in the core, honestly. you can simply pre-process the needle and create a few different needles + ufuzzy options, and just do a several independent searches, then combine and sort the results as you see fit. it will be slightly slower but that's an okay trade-off to keeping the internals relatively unopinionated and straightforward.
- Example: for haystack
["faster", "for a strange test"]
, searchingfast
should return"for a strange test"
first because it is an exact match against the first character of each "word" within"for a strange test"
.
that's just your preference. many others (myself included) would expect the prefix match first. it's not black and white, unfort.