fuzz-aldrin-plus icon indicating copy to clipboard operation
fuzz-aldrin-plus copied to clipboard

Weak character matches given more priority than better acronym match

Open mdahamiwal opened this issue 8 years ago • 7 comments

Hi @jeancroy, here is another scenario where I think acronym score is weak: candidates: sft/Tests/Plugins/GulpImportStepPerformer.js sft/Tests/Plugins/Cloud/GulpImportStepPerformer.js sft/Tests/tft/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js

search query: sft/gisp.js

results: sft/Tests/tfat/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js sft/Tests/Plugins/GulpImportStepPerformer.js sft/Tests/Plugins/Cloud/GulpImportStepPerformer.js

expected: sft/Tests/Plugins/GulpImportStepPerformer.js sft/Tests/Plugins/Cloud/GulpImportStepPerformer.js sft/Tests/tfat/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js

GssHelper.js doesn't even contain gisp in-order but still scored higher than better acronym matches.

mdahamiwal avatar Oct 10 '16 08:10 mdahamiwal

Hi, if you look at the code, there's the concept of "acronym prefrix". Basically it's an heuristic that try to decide if you are looking for an acronym or consecutive letters.

And that heuristic require you to to put acronym to the start of the query. Why the start ? Because that early there's no backtracking. (The cleanest way out of this I believe is multi-word support. That would be to try and segment the query into words and restart acronym prefix search on each words, however without backtracking there's no garanties those prefix wont overlap.)

So in query sft/gisp.js , sft/ is working against you.

Moreover, the path separator character (/ or \ depending on environement) is special sft/gisp.js is really interpreted as sft should be found in the immediate parent folder of gisp.js, and that's another reason sftRobotic ranks so well. I'll agree it's not documented or universally understood that way, so maybe I'll disable the behavior or make it optional.

For now i'll suggest using something like sgisp.js or stpgisp.js

prehaps another way to bias your way toward acronym is the uppercase-means-uppercase rule, while lowecase can match either case. It's not implemented but other fuzzy libraries are successful with it.

jeancroy avatar Oct 10 '16 11:10 jeancroy

Aha! I see it now, thanks @jeancroy for explaining it in details. I always thought a scoped acronym query sft/gisp.js is better that gisp.js but it turned out to be other way round. The main concern is GssHelper.js is scored highest which in no way relates to query gisp and it becomes worse on highlighting which feels like a bug. image

I understand that we don't want to backtrack so early especially to score acronym but how about a LCS b/w candidate's StartOfWord and query, something of this sort: sft/Tests/Plugins/GulpImportStepPerformer.js -> s/T/P/GISP sft/Tests/tfat/Reporting/Tools/teams/dev50/sftRobotic/GssHelper.js -> s/T/t/R/T/t/d/s/GH

Having them, if we try LCS for query sft/gisp.js the first will be scored higher. Currently, the problem is we have a scoreAcronym routine that gives out ZERO for all three candidates in this case. It exhausts the query even before it hits the actual acronyms.

thoughts?

mdahamiwal avatar Oct 10 '16 15:10 mdahamiwal

That may work.

However the computation of " is something an acronym ", camelCase, snake_case etc is actually super expensive. The trick I have found is to not compute it unless a[i] == b[j] which is infrequent.

In order to compare query with acronym form over lcs, we would require categorization of every character of the subject. (There might be some efficient way to store and reuse those, we'll have to think / benchmark )

jeancroy avatar Oct 10 '16 16:10 jeancroy

I'm thinking along the following line

on this line make accro_score an array instead of a constant.

so we sould have something like that after segmentation of query in words and then evaluation acronym of each words.

sft/gisp.js 00004444000

jeancroy avatar Oct 10 '16 20:10 jeancroy

Agree on that, that will be way more cleaner and performant. Only concern I see is it make that code more unreadable where it is already a bit tough to follow and understand.

mdahamiwal avatar Oct 12 '16 06:10 mdahamiwal

@jeancroy, any development for this issue? Thanks.

mdahamiwal avatar Oct 20 '16 07:10 mdahamiwal

I'll put some time on making this works this weekend.

Seeing the other topic about edlo editor localisation mayne your lcs on acronym space is the most correct idea.

jeancroy avatar Oct 20 '16 13:10 jeancroy