orama icon indicating copy to clipboard operation
orama copied to clipboard

Typo tolerance misses results

Open lucaong opened this issue 3 years ago • 8 comments
trafficstars

Describe the bug

The bug occurs in lyra-0.0.1-beta-13.

Typo tolerant searches miss expected results.

To Reproduce

const db = new Lyra({
  schema: {
    txt: "string",
  },
  stemming: false // Disabling stemming to avoid confounding the results
})

// Insert "stelle", and other words that are all within edit distance 2
await db.insert({ txt: 'stelle' })
await db.insert({ txt: 'stele' })
await db.insert({ txt: 'snelle' })
await db.insert({ txt: 'stellle' })
await db.insert({ txt: 'scelte' })

await db.search({ term: 'stelle', tolerance: 2 })
/* returns:
{
  elapsed: '1ms',
  hits: [
    { id: 'cuD10UKGVzBmEQz30dQKI', txt: 'stelle' },
    { id: 'DqXro5_9z2wiPQlTmK91n', txt: 'stellle' },
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined
  ],
  count: 2
}
*/

Expected behavior

All 5 documents should be returned, since they all contain a term within edit distance 2 from the query.

lucaong avatar Jul 21 '22 10:07 lucaong

Looks like a bug, thank you for opening this!

micheleriva avatar Jul 21 '22 11:07 micheleriva

Hey, I am looking for projects to contribute to and I found this issue in https://goodfirstissue.dev/. I looked at the code a bit and I feel like either I dont fully understand the feature or its implementation is a bit off. For example, the documentation (here) shows that searching for 'Cris' returns 1 result. But, with the same tolerance, searching for 'Bhris' should also return that same result, no? 'Bhris' doesnt work because the 'B' starts the find function down the 'Burton' subtree. Should 'Bhris' also work in this case?

As for the original steps to reproduce, I think the issue is the fact that the original tree search is incompletable with the tolerance feature(?). Basically, the current implementation uses the current subtree and checks for matches that can be made within that subtree, but that ignores all the other valid options. I believe i improved that but I feel like i still do not have a good grasp on the requirements here. Do check out: https://github.com/nearform/lyra/pull/47

Thanks

tlahav avatar Jul 25 '22 02:07 tlahav

Hey @tlahav , you're correct, thank you for opening a PR 🙏

micheleriva avatar Jul 25 '22 07:07 micheleriva

@tlahav 's PR definitely is a big step forward, but there are still issues remaining (see my comments on the PR). I'd recommend adding a larger base of tests, otherwise it is hard to make spot fixes.

To be completely honest, even after making it correct, the current approach is very inefficient, as it is running Levenshtein distance calculation on the whole trie for each term. There are dynamic programming techniques to make it vastly more efficient.

One option would be to build Lyra on top of a proven and optimized full-text search core implementation, such as MiniSearch (full disclosure: I am the author). On top of an extensive unit tests suite, MiniSearch has generative property-based tests asserting the correctness of the algorithms and data structures. It has no dependency, and is a mere 7kb in size. It has extensive documentation and is in wide use on many projects. On top of the user-centric API, it also offers a programmatic API for other libraries that seek to use MiniSearch as a building block for their own query language, composition, etc. (see the docs for Advanced Query Combinations)

MiniSearch is also not the only full-text search implementation for JavaScript: Lunr comes to mind as another great one. This blog post includes a comparison between MiniSearch and some of the other existing solutions.

Again, I don't want to sound dismissive here, but I wanted to put the option on the table. No need to explain if you prefer not, it's your call @micheleriva , I do empathize with the pleasure of reinventing the wheel 😉 .

lucaong avatar Jul 25 '22 13:07 lucaong

Coming back to this after a bit. I think there needs to be an executive decision made here on which way to go. If we were to implement the search feature with MiniSearch I dont see why not replace the entire search with it. On the other hand, if we do want to reinvent the wheel.. We would then need more requirements. Regardless, I fear I do not have the bandwidth to work on this further.

tlahav avatar Oct 17 '22 13:10 tlahav

Hi @tlahav, We have plans and bandwidth to work on Lyra and fix what needs to be fixed, so we'll proceed that way. @lucaong already knows how much I appreciate MiniSearch and other search libraries, and it's not a matter of "reinventing the wheel," as plans with Lyra are beyond full-text search exclusively.

Lyra is currently backed by NearForm, and we have a roadmap for it that I cannot make public yet, but we definitely have big plans to support a large number of features that requires deep knowledge of every bit we're moving with the search algorithm.

If you find the current issue to be a blocker for your applications, MiniSearch, Fuse.js, Lunr, etc. are amazing solutions! Maybe go with MiniSearch which is actively maintained (Lunr seems abandoned) 🙂

One problem with the current approach is that we're applying a Levenshtein algorithm against strings, while we should do that directly on the trees. @jkomyno did a fantastic job improving this approach with https://github.com/LyraSearch/lyra/pull/131, but we're definitely going to refactor how we measure the Levenshtein distance moving from strings to trie nodes.

We have this in our backlog, so keep an eye on the repo to get notified when we'll release this 🙂

micheleriva avatar Oct 17 '22 13:10 micheleriva

Thanks for the explanation. I suggest removing the 'good first issue' and 'help wanted' labels off of this ticket so it wont show up in sites that aggregate these tickets.

I am a fan of the library though, do tell me if there's a different item to look at.

tlahav avatar Oct 19 '22 14:10 tlahav

Consider that MiniSearch also exposes its general purpose radix tree implementation as a module called SearchableMap that can be imported separately from the main module. It is a data structure with the same exact interface as a JavaScript Map, but implemented as a radix tree (a trie compressed by joining chains of nodes with no siblings). This makes it possible to expose two additional methods, for searching keys by fuzzy and prefix search. That's what powers the inverted index in MiniSearch, and it does already implement Levenshtein distance computation on the radix tree nodes with an efficient dynamic programming algorithm.

The data structure is well tested with near total coverage, also including generative (a.k.a. property based) tests.

lucaong avatar Oct 19 '22 18:10 lucaong

Closing this, we moved many things and have full-time people working on this.

micheleriva avatar Jun 30 '23 13:06 micheleriva