fuzz-aldrin-plus icon indicating copy to clipboard operation
fuzz-aldrin-plus copied to clipboard

Quality of matches: Proper casing

Open mrkishi opened this issue 9 years ago • 3 comments

Hello, folks.

I've came across the following situation:

const data = ['Eat ten pizzas', 'Ten Pizzas']
fuzz.filter(data, 'tp')
// > ['Eat ten pizzas', 'Ten Pizzas']

Wouldn't Ten Pizzas make more sense, here? I haven't studied the algorithm too deeply, yet, but I think this is caused by the proper casing rules, as evidenced by this version of the same test:

const data = ['eat ten pizzas', 'ten pizzas']
fuzz.filter(data, 'tp')
// > ['ten pizzas', 'eat ten pizzas']

Now, while proper casing is indeed important when choosing matches, I feel like it's not a good indicator on lowercase queries, and it's currently being given too much weight.

A query that contains uppercase characters conveys a proper casing intention quite strongly. The opposite, however, is not true: a lowercase query doesn't mean you'd prefer lowercase matches.

Consider these hypothetical queries:

(['Proper Case', 'A proper case'], 'pc') => 'Proper Case' (['proper case', 'a Proper Case'], 'PC') => 'A Proper Case'

fuzzaldrin-plus gets the second one, but misses the first. Am I off-base here in what I consider better matches?

mrkishi avatar Oct 17 '16 03:10 mrkishi

Hi mrkishi , thanks for the report.

while proper casing is indeed important when choosing matches, I feel like it's not a good indicator on lowercase queries,

I call what you describe as smart-case. Uppercase means uppercase, lowercase can mean anything. That convention is very popular in vim circle, among other.

There are different reasons I did not went that way. One of which is that I try to be agnostic of programming style. snake_case CamelCase and kebab-case pretty much are weighted the same.

Also my main concern is reachability. Imagine you have a local variable named something and also a method named Something or SomethingElse. There must exist a query that allow to select the lowercase local variable.

Once reachability is there, then I know that after a bit of learning curve we have an useful tool. If I optimize against reachability then some option will not be selectable no matter the experience.

and it's currently being given too much weight.

There was a LOT of pressure for proper casing. Often for CamelCase. But also some use case for proper casing as-is.

What to do from here ?

It might be a coincidence but both your example fall into what I call acronym exact match. (that is the acronym of the subject is exactly the query)

  • 'pc' => 'Proper Case'
  • 'tp' => 'Ten Pizzas'

In theory it's also a strong bonus, but it grows with acronym length so I may investigate what to think here.


Another possibility is to have an option switch to behave in smartCase mode. It's not that hard to do, and in the end, it would be about testing if it's too slow to maintain both code path.

jeancroy avatar Oct 17 '16 04:10 jeancroy

Thank you for the detailed (and prompt) response, @jeancroy!

The reachability argument is extremely convincing, and I didn't think of that. However, I'm not sure I completely understand its impact on these examples. It doesn't seem like smart-case goes against reachability. On the something vs Something example, an s query would favor something regardless of smart-casing support.

But even disregarding smart-case, I still come across some odd behaviors.

Let me preface this message with some (made-up, sorry) term definitions to minimize confusion (it's still pretty confusing..):

Literal [pattern]: a pattern of consecutive letters
Acronym [pattern]: a pattern of consecutive start-of-word letters

Match: any combination of sequential literal and/or acronym patterns

Literal [exact] match: an acronym pattern that spans 100% of the query
Acronym [exact] match: a literal pattern that spans 100% of the query
Exact match: a literal or acronym match

Full-length literal [exact] match: a literal match that also spans 100% of the candidate
Full-length acronym [exact] match: an acronym match that also spans 100% of the candidate
Full-length [exact] match: a full-length literal or full-length acronym match

For instance, proper casing is apparently not as influential on literal matches as it is on acronym matches:

(['A PROPER CASE', 'a proper case'], 'pc') => 'a proper case'
(['A PROPER CASE', 'a proper case'], 'PC') => 'A PROPER CASE'

(['A PROPERCASE', 'a propercase'], 'pr') => 'A PROPERCASE'
(['A PROPERCASE', 'a propercase'], 'PR') => 'A PROPERCASE'

// factor in length
(['A PR', 'a pr'], 'pr') => 'A PR'
(['A PR', 'a pr'], 'PR') => 'A PR'

A full-length literal match will "ignore" case errors, while an equivalent full-length acronym will not:

(['PROPER CASE', 'a proper case'], 'pc') => 'a proper case'
(['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

(['PR', 'a pr'], 'pr') => 'PR'

--

(['A PROPER CASE', 'proper case'], 'PC') => 'A PROPER CASE'
(['A PROPERCASE', 'propercase'], 'PROPERCASE') => 'propercase'

(['A PR', 'pr'], 'PR') => 'pr'

I have the feeling that acronyms would work better if these behaviors were aligned: either full-length acronyms should be more lenient towards case mismatches (like full-length literals), or literal matches should favor proper casing over being full-length.

Personally, I think giving full-length acronyms the same text-casing tolerance as full-length literal matches would be the more useful approach:

(['PROPER CASE', 'a proper case'], 'pc') => 'PROPER CASE'
(['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

Thoughts?

mrkishi avatar Oct 17 '16 13:10 mrkishi

Thank you for the report .

Typical definition of smart case I've seen is that proper case on lowercase don't matter. I guess it can also be interpretted as matter less and you'd be correct that would allow better reach.

For the other finding you may have found a bug. There's express lane for some cases and I may not have synced those properly.

What I can do is implements smart case. Then lower bonus for case sensitive. And see if I pass all the tests with that.

Your idea of being more lenient with exact acronym seems good. It's hard to get in that case by chance.

---------- Forwarded message --------- From: mrkishi [email protected] Date: Mon, Oct 17, 2016, 09:56 Subject: Re: [jeancroy/fuzzaldrin-plus] Quality of matches: Proper casing (#27) To: jeancroy/fuzzaldrin-plus [email protected] Cc: Jean Christophe Roy [email protected], Mention < [email protected]>

Thank you for the detailed (and prompt) response, @jeancroy https://github.com/jeancroy!

The reachability argument is extremely convincing, and I didn't think of that. However, I'm not sure I completely understand its impact on these examples. It doesn't seem like smart-case goes against reachability. On the something vs Something example, an s query would favor something regardless of smart-casing support.

But even disregarding smart-case, I still come across some odd behaviors.

Let me preface this message with some (made-up, sorry) term definitions to minimize confusion (it's still pretty confusing..):

Literal [pattern]: a pattern of consecutive lettersAcronym [pattern]: a pattern of consecutive start-of-word letters Match: any combination of sequential literal and/or acronym patterns Literal [exact] match: an acronym pattern that spans 100% of the queryAcronym [exact] match: a literal pattern that spans 100% of the queryExact match: a literal or acronym match Full-length literal [exact] match: a literal match that also spans 100% of the candidateFull-length acronym [exact] match: an acronym match that also spans 100% of the candidateFull-length [exact] match: a full-length literal or full-length acronym match


For instance, proper casing is apparently not as influential on literal matches as it is on acronym matches:

(['A PROPER CASE', 'a proper case'], 'pc') => 'a proper case' (['A PROPER CASE', 'a proper case'], 'PC') => 'A PROPER CASE'

(['A PROPERCASE', 'a propercase'], 'pr') => 'A PROPERCASE' (['A PROPERCASE', 'a propercase'], 'PR') => 'A PROPERCASE' // factor in length (['A PR', 'a pr'], 'pr') => 'A PR' (['A PR', 'a pr'], 'PR') => 'A PR'


A full-length literal match will "ignore" case errors, while an equivalent full-length acronym will not:

(['PROPER CASE', 'a proper case'], 'pc') => 'a proper case' (['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

(['PR', 'a pr'], 'pr') => 'PR'

(['A PROPER CASE', 'proper case'], 'PC') => 'A PROPER CASE' (['A PROPERCASE', 'propercase'], 'PROPERCASE') => 'propercase'

(['A PR', 'pr'], 'PR') => 'pr'

I have the feeling that acronyms would work better if these behaviors were aligned: either full-length acronyms should be more lenient towards case mismatches (like full-length literals), or literal matches should favor proper casing over being full-length.

Personally, I think giving full-length acronyms the same text-casing tolerance as full-length literal matches would be the more useful approach:

(['PROPER CASE', 'a proper case'], 'pc') => 'PROPER CASE' (['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

Thoughts?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/jeancroy/fuzzaldrin-plus/issues/27#issuecomment-254214541, or mute the thread https://github.com/notifications/unsubscribe-auth/AMLCEh77aUwHRC11m4xGmTKxhC6s-muGks5q036KgaJpZM4KYMbK .

jeancroy avatar Oct 17 '16 22:10 jeancroy