fuzz-aldrin-plus icon indicating copy to clipboard operation
fuzz-aldrin-plus copied to clipboard

Quality of matches: Extended acronyms

Open mrkishi opened this issue 9 years ago • 9 comments

I came across another potential improvement.


Suppose we want to match editor localization from a long list of candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were conflicts up to editor lo. That's no good. We also realize it'd be foolish to try querying for el, as that's a very common acronym. Smart as we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]

Scoring acronyms similarly to start-of-word matches is really great. However, it looks like it could be improved by also taking into account consecutive matching letters. In the previous example, edlo is a 4 consecutive middle-of-word letters match, but it is also an acronym with 4 start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to dive deeper into the algorithm), but wouldn't it be a desirable addition?

mrkishi avatar Oct 17 '16 04:10 mrkishi

Yeah mixing acronym and consecutive letters in query don't work well. There's an heuristic that try to determine if you want acronym or consecutive letters and you'll confuse it.

eloc probably would work best.

On Mon, Oct 17, 2016, 00:23 mrkishi [email protected] wrote:

I came across another potential improvement.

Suppose we want to match editor localization from a long list of candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were conflicts up to editor lo. That's no good. We also realize it'd be foolish to try querying for el, as that's a very common acronym. Smart as we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]


Scoring acronyms similarly to start-of-word matches is really great. However, it looks like it could be improved by also taking into account consecutive matching letters. In the previous example, edlo is a 4 consecutive middle-of-word letters match, but it is also an acronym with 4 start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to dive deeper into the algorithm), but wouldn't it be a desirable addition?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeancroy/fuzzaldrin-plus/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/AMLCEh08IhpN4gW0iGFfZPQEgMdgxtPzks5q0vhBgaJpZM4KYOGj .

jeancroy avatar Oct 17 '16 04:10 jeancroy

The problem is something like 'ed' vs 'edit database' the d is ambiguous so we compare two dumb strategy : consecutive letter and consecutive acronym to take a decision. Mixing consecutive and acronym won't score well for acronym although those may be word part.

On Mon, Oct 17, 2016, 00:31 Jean Christophe Roy [email protected] wrote:

Yeah mixing acronym and consecutive letters in query don't work well. There's an heuristic that try to determine if you want acronym or consecutive letters and you'll confuse it.

eloc probably would work best.

On Mon, Oct 17, 2016, 00:23 mrkishi [email protected] wrote:

I came across another potential improvement.

Suppose we want to match editor localization from a long list of candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were conflicts up to editor lo. That's no good. We also realize it'd be foolish to try querying for el, as that's a very common acronym. Smart as we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]


Scoring acronyms similarly to start-of-word matches is really great. However, it looks like it could be improved by also taking into account consecutive matching letters. In the previous example, edlo is a 4 consecutive middle-of-word letters match, but it is also an acronym with 4 start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to dive deeper into the algorithm), but wouldn't it be a desirable addition?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeancroy/fuzzaldrin-plus/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/AMLCEh08IhpN4gW0iGFfZPQEgMdgxtPzks5q0vhBgaJpZM4KYOGj .

jeancroy avatar Oct 17 '16 04:10 jeancroy

Edlo count as two segment of two letters, we are not able to teleport while counting consecutive.(we do count consecutive in real space and acronym space and take the vest of both)

On Mon, Oct 17, 2016, 00:42 Jean Christophe Roy [email protected] wrote:

The problem is something like 'ed' vs 'edit database' the d is ambiguous so we compare two dumb strategy : consecutive letter and consecutive acronym to take a decision. Mixing consecutive and acronym won't score well for acronym although those may be word part.

On Mon, Oct 17, 2016, 00:31 Jean Christophe Roy [email protected] wrote:

Yeah mixing acronym and consecutive letters in query don't work well. There's an heuristic that try to determine if you want acronym or consecutive letters and you'll confuse it.

eloc probably would work best.

On Mon, Oct 17, 2016, 00:23 mrkishi [email protected] wrote:

I came across another potential improvement.

Suppose we want to match editor localization from a long list of candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were conflicts up to editor lo. That's no good. We also realize it'd be foolish to try querying for el, as that's a very common acronym. Smart as we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]


Scoring acronyms similarly to start-of-word matches is really great. However, it looks like it could be improved by also taking into account consecutive matching letters. In the previous example, edlo is a 4 consecutive middle-of-word letters match, but it is also an acronym with 4 start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to dive deeper into the algorithm), but wouldn't it be a desirable addition?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeancroy/fuzzaldrin-plus/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/AMLCEh08IhpN4gW0iGFfZPQEgMdgxtPzks5q0vhBgaJpZM4KYOGj .

jeancroy avatar Oct 17 '16 04:10 jeancroy

Could acronym space be extended to take more than 1 start-of-word letters into account? Without drastic changes to the algorithm, that is.

Something that would represent the following rules:

  • ed could be counted as a segment of two letters: 1*2
  • el, a segment of two acronyms (1-letter): 2*1
  • edl, a segment of two acronyms ([2, 1]-letter): 2+1
  • edlo, a segment of two acronyms ([2, 2]-letter): 2+2

mrkishi avatar Oct 17 '16 14:10 mrkishi

First, before making large change to algorithm I try to evaluate how far are we from a solution. In this particular case, typing one or two letters edlo => edloca would probably suffice.

Then it's a bit hard to answer if something is possible. Mostly because I'd hate to answer no. But currently acronym and consecutive are handled by very separate part of code.

There's a few area where I know the code that handle acronym can be improved. However because start of word count as much as acronym it does not matter if e belong to el or to ed it will be scored the same.

I rely a lot on sequence length to sort useful from garbage. (And there's multiple example of garbage accidentally hitting acronym letters). A sequence of two is one letter away from an accidental match, hence my first recommandation to try to aim for a sequence of at least 3.

jeancroy avatar Oct 17 '16 14:10 jeancroy

I'll add thank you for the time you put into these report. Even if I don't see an obvious solution now I keep those in mind for later.

jeancroy avatar Oct 17 '16 15:10 jeancroy

Makes sense! For now, I'll train to use longer consecutive matches instead of acronyms.

I'm the one who should be thanking you for such an awesome library. I wish I could help you with more than simple use-case discussions, but I'm afraid it'll take me quite a while to grasp the algorithms in play here (I'm trying to, nevertheless!).

Also, I just saw your FuzzySearch library. It seems to share (maybe more than) a few similarities with fuzzaldrin-plus, while being "fuzzier" in its heuristics usage. Is that a fair statement? Do you see one of them ever deprecating the other?

Thank you, again :100:

mrkishi avatar Oct 17 '16 18:10 mrkishi

Hi

I'll train to use longer consecutive matches instead of acronyms.

One way that works well is typing all the acronym first letters, then if you need more refinement, you add parts of last letters.

Also, I just saw your FuzzySearch library.

FuzzySearch does word-against-word search, including out of order words.

Use case: I want to .... ( paint my room ) would match against ( room painting )

fuzzaldin-plus does bag-of-letters against programming symbol or path search.

In the end it was natural language vs programming language. Also FuzzySearch allow more complex structure. IE search in title, author, description.

I'll try to answer your other analysis later today.

jeancroy avatar Oct 17 '16 19:10 jeancroy

@jeancroy Wow, I came here wondering if there was a way to get fuzzaldrin-plus to match out-of-order queries (i.e. "bar foo" to match "foo bar") and FuzzySearch is exactly what I'm looking for. My project is using more natural language than file paths. Thanks for the time you spend on these projects, and for making them so fast using radical algorithms =)

aguynamedben avatar Jul 25 '17 06:07 aguynamedben