couchdb
couchdb copied to clipboard
Best practice for N-gram and set Lucene param with Clouseau
CouchDB/Clouseau indexing allows analyzers but what about n-gram tokenization? What is the best practive for n-grams? Should we use an algorithm to do n-grams within the index javascript function? Or can we take advantage of Lucene n-gram function?
Also how can we set Lucene parameters such as allowing leading wildcard (https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAllowLeadingWildcard(boolean))?
We don't expose the NGram analyzers in Clouseau today but we'd consider merging a pull request if you want to add it.
We don't support setting of that parameter either, and I don't think we'd accept a patch to allow it given it has such bad performance implications.
@rnewson I'd love to contribute and add the n-gram analyzer. Unfortunately I don't know Erlang and working on Clouseau is a bit overwhelming since the project seams quite complex with very little documentation... I m also not an expert in Java so that doesn't help either!
Regarding the leading wildcard parameter, it was just an example! I don't plan to use it but wanted to know if there was a way to use all the parameters Lucene offers.
@rnewson Partial search is widely used especially for auto-complete. Any chance someone can help exposing the n-gram analyzer? I have posted an issue to get some guidance here but Clouseau doesn't seem super active!
PS There are other useful analyzers that would be great exposing such as n-gram edge...
hi @natcohen sorry for silence.
Appreciate the desire to help but things move forward in this project when folks contribute. It's useful to highlight a desire for this feature, though. If someone works on it to a reviewable standard, I'm sure someone will have time to help it the last few steps.
@natcohen I see your efforts here and appreciate them a lot. We are also eagerly looking out for an n-gram analyzer in Clouseau - but it seems to be very low priority. If you look for auto-complete we've had good experience with prefix searches using the wildcard at the end (*) - that works out of the box.
Prefix-Search (aka words starting with X)
Out-of-the-box, index the field normally and query with tailing wildcard (value*
).
Suffix-Search (aka words ending with X) The other way around, suffix search (tokens ending certain things), that is trickier but we are working on a prototype that might just allow that:
- Index string as you normally would (
index("field", "value")
) to allow prefix search - Index the string again in a different field but reversed (
index("r_field", "eulav")
) to allow suffix search
When you perform searches search in both fields add the input twice, once with the query reversed on the reversed field:
field:value* r_field:eulav*
which will search for tokens starting with "value" or ending with "value".
What is left is infix search, so words containing "value" for which I think ngram-analyzer are the only way to go.