fuzz-aldrin-plus
fuzz-aldrin-plus copied to clipboard
Weak character matches given more priority than better acronym match
Hi @jeancroy, here is another scenario where I think acronym score is weak:
candidates:
sft/Tests/Plugins/GulpImportStepPerformer.js
sft/Tests/Plugins/Cloud/GulpImportStepPerformer.js
sft/Tests/tft/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js
search query:
sft/gisp.js
results:
sft/Tests/tfat/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js
sft/Tests/Plugins/GulpImportStepPerformer.js
sft/Tests/Plugins/Cloud/GulpImportStepPerformer.js
expected:
sft/Tests/Plugins/GulpImportStepPerformer.js
sft/Tests/Plugins/Cloud/GulpImportStepPerformer.js
sft/Tests/tfat/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js
GssHelper.js
doesn't even contain gisp
in-order but still scored higher than better acronym matches.
Hi, if you look at the code, there's the concept of "acronym prefrix". Basically it's an heuristic that try to decide if you are looking for an acronym or consecutive letters.
And that heuristic require you to to put acronym to the start of the query. Why the start ? Because that early there's no backtracking. (The cleanest way out of this I believe is multi-word support. That would be to try and segment the query into words and restart acronym prefix search on each words, however without backtracking there's no garanties those prefix wont overlap.)
So in query sft/gisp.js
, sft/
is working against you.
Moreover, the path separator character (/
or \
depending on environement) is special sft/gisp.js
is really interpreted as sft
should be found in the immediate parent folder of gisp.js
, and that's another reason sftRobotic
ranks so well. I'll agree it's not documented or universally understood that way, so maybe I'll disable the behavior or make it optional.
For now i'll suggest using something like sgisp.js
or stpgisp.js
prehaps another way to bias your way toward acronym is the uppercase-means-uppercase rule, while lowecase can match either case. It's not implemented but other fuzzy libraries are successful with it.
Aha! I see it now, thanks @jeancroy for explaining it in details.
I always thought a scoped acronym query sft/gisp.js
is better that gisp.js
but it turned out to be other way round.
The main concern is GssHelper.js
is scored highest which in no way relates to query gisp
and it becomes worse on highlighting which feels like a bug.
I understand that we don't want to backtrack so early especially to score acronym but how about a LCS b/w candidate's StartOfWord and query, something of this sort:
sft/Tests/Plugins/GulpImportStepPerformer.js
-> s/T/P/GISP
sft/Tests/tfat/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js
-> s/T/t/R/T/t/d/s/GH
Having them, if we try LCS for query sft/gisp.js
the first will be scored higher. Currently, the problem is we have a scoreAcronym routine that gives out ZERO for all three candidates in this case. It exhausts the query even before it hits the actual acronyms.
thoughts?
That may work.
However the computation of " is something an acronym ", camelCase, snake_case etc is actually super expensive. The trick I have found is to not compute it unless a[i] == b[j] which is infrequent.
In order to compare query with acronym form over lcs, we would require categorization of every character of the subject. (There might be some efficient way to store and reuse those, we'll have to think / benchmark )
I'm thinking along the following line
on this line make accro_score an array instead of a constant.
so we sould have something like that after segmentation of query in words and then evaluation acronym of each words.
sft/gisp.js
00004444000
Agree on that, that will be way more cleaner and performant. Only concern I see is it make that code more unreadable where it is already a bit tough to follow and understand.
@jeancroy, any development for this issue? Thanks.
I'll put some time on making this works this weekend.
Seeing the other topic about edlo
editor localisation
mayne your lcs on acronym space is the most correct idea.