couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Best practice for N-gram and set Lucene param with Clouseau

Open natcohen opened this issue 4 years ago • 5 comments

CouchDB/Clouseau indexing allows analyzers but what about n-gram tokenization? What is the best practive for n-grams? Should we use an algorithm to do n-grams within the index javascript function? Or can we take advantage of Lucene n-gram function?

Also how can we set Lucene parameters such as allowing leading wildcard (https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAllowLeadingWildcard(boolean))?

natcohen avatar Mar 04 '20 14:03 natcohen

We don't expose the NGram analyzers in Clouseau today but we'd consider merging a pull request if you want to add it.

We don't support setting of that parameter either, and I don't think we'd accept a patch to allow it given it has such bad performance implications.

rnewson avatar Mar 12 '20 08:03 rnewson

@rnewson I'd love to contribute and add the n-gram analyzer. Unfortunately I don't know Erlang and working on Clouseau is a bit overwhelming since the project seams quite complex with very little documentation... I m also not an expert in Java so that doesn't help either!

Regarding the leading wildcard parameter, it was just an example! I don't plan to use it but wanted to know if there was a way to use all the parameters Lucene offers.

natcohen avatar Mar 12 '20 14:03 natcohen

@rnewson Partial search is widely used especially for auto-complete. Any chance someone can help exposing the n-gram analyzer? I have posted an issue to get some guidance here but Clouseau doesn't seem super active!

PS There are other useful analyzers that would be great exposing such as n-gram edge...

natcohen avatar Mar 19 '20 19:03 natcohen

hi @natcohen sorry for silence.

Appreciate the desire to help but things move forward in this project when folks contribute. It's useful to highlight a desire for this feature, though. If someone works on it to a reviewable standard, I'm sure someone will have time to help it the last few steps.

rnewson avatar Mar 31 '20 09:03 rnewson

@natcohen I see your efforts here and appreciate them a lot. We are also eagerly looking out for an n-gram analyzer in Clouseau - but it seems to be very low priority. If you look for auto-complete we've had good experience with prefix searches using the wildcard at the end (*) - that works out of the box.

Prefix-Search (aka words starting with X) Out-of-the-box, index the field normally and query with tailing wildcard (value*).

Suffix-Search (aka words ending with X) The other way around, suffix search (tokens ending certain things), that is trickier but we are working on a prototype that might just allow that:

  • Index string as you normally would (index("field", "value")) to allow prefix search
  • Index the string again in a different field but reversed (index("r_field", "eulav")) to allow suffix search

When you perform searches search in both fields add the input twice, once with the query reversed on the reversed field: field:value* r_field:eulav*

which will search for tokens starting with "value" or ending with "value".

What is left is infix search, so words containing "value" for which I think ngram-analyzer are the only way to go.

streunerlein avatar Sep 01 '23 11:09 streunerlein