Mango+Nouveau queries and Indexes don’t behave as expected
Hello,
I’ve been trying out Mango and Nouveau together, orienting myself along what’s in the docs and the tests, and have run into several problems. I’ve provided a full set of cURL commands to reproduce the issues, namely:
- Queries against
stringindex fields only occasionally work $textqueries work the same as regular queries againststringindex fields: they’re keyword searches, not Lucene full-text searches"type": "text"index fields never seem to work at all
This didn’t become obvious in testing because the tests seem to:
- Use the one of the two
stringindex fields that actually works by sheer coincidence - Use a
$textquery that also worked by coincidence, not because the underlying mechanism works properly
I’ve basically done something similar to the Mango-related tests in this file and linked the relevant lines from the test file below.
Using CouchDB 3.4.2 with nouveau-1.0-SNAPSHOT.jar. For all of these, _explain showed that the correct index was used, unless I’ve stated otherwise.
Setup
Add this line to all requests in case you need auth:
--user COUCHDB_USERNAME:COUCHDB_PASSWORD \
First, make a fresh DB:
curl -X PUT \
'http://127.0.0.1:5984/mouveau' \
--header 'Accept: */*' \
Insert some docs:
curl -X POST \
'http://127.0.0.1:5984/mouveau/_bulk_docs' \
--header 'Accept: */*' \
--header 'Content-Type: application/json' \
--data-raw '
{
"docs": [
{
"_id": "FishStew",
"servings": 4,
"subtitle": "Delicious with freshly baked bread",
"title": "Fish Stew",
"type": "fish"
},
{
"_id": "LambStew",
"servings": 6,
"subtitle": "Serve with a whole meal scone topping",
"title": "Lamb Stew",
"type": "meat"
},
{
"_id": "Dumplings",
"servings": 8,
"subtitle": "Hand-made dumplings make a great accompaniment",
"title": "Dumplings",
"type": "meat"
}
]
}'
Now make a Nouveau index. This is basically the same index as used in the tests but adapted to our documents' field names:
curl -X POST \
'http://127.0.0.1:5984/mouveau/_index' \
--header 'Accept: */*' \
--header 'Content-Type: application/json' \
--data-raw '{
"type": "nouveau",
"index": {
"fields": [
{"name": "title", "type": "string"},
{"name": "servings", "type": "number"},
{"name": "type", "type": "string"}
],
"default_analyzer": "keyword"
}
}'
Let’s do some queries. First, querying for strings, analogous to the Mango search by string test:
curl -X POST \
'http://127.0.0.1:5984/mouveau/_find' \
--header 'Accept: */*' \
--header 'Content-Type: application/json' \
--data-raw '{
"selector": {
"type": "meat"
}
}'
That works.
(Only showing selector lines for brevity from now on, the cURL command is always the same)
"selector": {
"title": "Lamb Stew"
}
This doesn’t work at all:
{
"error": "nouveau_search_error",
"reason": "bad_request: field \"title_3astring\" was indexed without position data; cannot run PhraseQuery (phrase=title_3astring:\"lamb stew\")",
"ref": 2009526893
}
Hm, maybe if we just search for a single word?
"selector": {
"title": "Dumplings"
}
Different nope, but also nope:
{
"docs": [],
"bookmark": "W10="
}
What about numbers? This is analogous to the Mango search by number test:
"selector": {
"servings": {"$gte": 6}
}
That works.
The last test is a $text query, same as the mango search by text test. Now, interestingly, the doc example for this query type selector looks different than the one in the test:
// test: just a keyword
{"$text": "hello"}
// docs: this is Lucene full-text search syntax
{
"_id": { "$gt": null },
"$text": "director:George"
}
Also, remember: the test runs this $text query against an index that does not have any "type":"text" fields! And surprisigly, this works!
"selector": {
"$text": "lamb"
}
But this seems to be only a keyword search, and not a fully-fledged text search, which becomes apparent because these all fail:
"selector": {
"$text": "lamb s"
}
"selector": {
"$text": "title:lamb"
}
"selector": {
"$text": "title:lam*"
}
"selector": {
"$text": "lam*"
}
And all permutations thereof.
Summary:
numberqueries, as in the tests, work. Yay.stringqueries, as in the tests, sometimes work. Couldn’t find a pattern yet.$textqueries against the index as used in the tests do not actually perform Lucene queries, but the same keyword-style searches as thestringqueries, but against all (string?) index fields.
Workaround attempts:
- Change the order of the
stringqueries to see what happens: no change, querying fortitlestill fails,typestill works. - Try different values for
default_analyzerandanalyzer.default, bothstandardandenglishdo not change the results for$textqueries.
What about the text index type?
Now, the test mango index and the example index in the docs are the same, and interestingly, the docs refer to this as a Text index. However, this section also describes a text index field type, just like in Nouveau, to go alongside string and number and boolean index field types. This isn’t documented or tested anywhere, but I’d assume this to work like so (delete the old ddoc first):
curl -X POST \
'http://127.0.0.1:5984/mouveau/_index' \
--header 'Accept: */*' \
--header 'Content-Type: application/json' \
--data-raw '{
"type": "nouveau",
"index": {
"fields": [
{"name": "title", "type": "text"}
],
"default_analyzer": "english"
}
}'
But no query against this returns anything, no permutation of $text works, and using something like "title": "Dumplings" falls back to not using any index.
Expected results:
- Queries against
stringindex fields always work. $textqueries use Lucene syntax? Unsure what the intention is here, the docs do one thing, the tests another.texttype indexes are queryable with a selector like"$text": "fieldname:querystring".
First point is that string fields are not for searching, and the docs say this, though perhaps it needs to be more prominent.
I can explain the odd results in detail;
- "title": "Lamb Stew"
This fails because the query is converted to a phrase query, and a string field is stored without position data (since there's only a single token). The error is definitely ugly, though.
- "title": "Dumplings"
This fails because the value is indexed as Dumplings but the query gets case-folded to dumplings and therefore doesn't match.
- "$text": "lamb"
This works because adds all string/text fields into the default field. Which is why "$text": "title:lamb" doesn't match, because field title was not case-folded at index time and Lamb != lamb.
Not sure where to go here. At minimum the docs need enhancing to make it clear not to query on string fields, but perhaps those should be errors instead.
figuring out the final bit on text fields. that seems like a genuine bug (clouseau only had string). what happens is that no fields are parsed out of the doc and thus it doesn't get indexed at all.
(great bug report btw, much appreciated)
Right, had to re-familiarise myself with the obfuscating mango/_find layer and I think all of this is correct behaviour (🤮 , right?)
with _find you can only perform text queries on the default field (which is the concatenation of all your string fields). when you use the $text operator it constructs the query as $default:whateveryoutyped. So if you searched for title:foo you'd query $default:title:foo which is either a syntax error or at the very least won't match docs with 'foo' in the title field (and it's not even called 'title' in the index that mango creates).
So...
- don't search on the
stringfields, they can be sorted on though (they can also be used as facets but _find has never been able to reach that feature even with clouseau) - when using
$textremember it is searching across allstringfields (indexed with the default analyzer, so don't set that tokeywordif you want to do term, phrase, wildcard queries.
The real interface to nouveau is the _nouveau endpoint. None of these issues affect that. The index is built with the field names and types that you specify with the index function and the ?q= query parameter supports fielded search queries and everything else.
tldr;
the mango / _ find is a deliberately constrained interface to help users that know the mongo syntax and it has many limitations.
from https://cloud.ibm.com/docs/Cloudant?topic=Cloudant-ibm-cloudant-query-parameters which says
The $text operator is based on a Lucene search with a standard analyzer. The operator isn't case-sensitive, and matches on any words. However, the $text operator doesn't support full Lucene syntax, such as wildcards, fuzzy matches, or proximity detection. For more information, see the Search documentation. The $text operator applies to all strings found in the document. If you place this operator in the context of a field name, it's invalid.
wish I'd remembered that earlier...
how about we introduce a new operator, say $fulltext that can be used with more Lucene fidelity?