couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Mango+Nouveau queries and Indexes don’t behave as expected

Open espy opened this issue 9 months ago • 6 comments

Hello,

I’ve been trying out Mango and Nouveau together, orienting myself along what’s in the docs and the tests, and have run into several problems. I’ve provided a full set of cURL commands to reproduce the issues, namely:

  • Queries against string index fields only occasionally work
  • $text queries work the same as regular queries against string index fields: they’re keyword searches, not Lucene full-text searches
  • "type": "text" index fields never seem to work at all

This didn’t become obvious in testing because the tests seem to:

  • Use the one of the two string index fields that actually works by sheer coincidence
  • Use a $text query that also worked by coincidence, not because the underlying mechanism works properly

I’ve basically done something similar to the Mango-related tests in this file and linked the relevant lines from the test file below.

Using CouchDB 3.4.2 with nouveau-1.0-SNAPSHOT.jar. For all of these, _explain showed that the correct index was used, unless I’ve stated otherwise.

Setup

Add this line to all requests in case you need auth:

--user COUCHDB_USERNAME:COUCHDB_PASSWORD \

First, make a fresh DB:

curl  -X PUT \
  'http://127.0.0.1:5984/mouveau' \
  --header 'Accept: */*' \

Insert some docs:

curl  -X POST \
  'http://127.0.0.1:5984/mouveau/_bulk_docs' \
  --header 'Accept: */*' \
  --header 'Content-Type: application/json' \
  --data-raw '
{
    "docs": [
        {
            "_id": "FishStew",
            "servings": 4,
            "subtitle": "Delicious with freshly baked bread",
            "title": "Fish Stew",
            "type": "fish"
        },
        {
            "_id": "LambStew",
            "servings": 6,
            "subtitle": "Serve with a whole meal scone topping",
            "title": "Lamb Stew",
            "type": "meat"
        },
        {
            "_id": "Dumplings",
            "servings": 8,
            "subtitle": "Hand-made dumplings make a great accompaniment",
            "title": "Dumplings",
            "type": "meat"
        }
    ]
}'

Now make a Nouveau index. This is basically the same index as used in the tests but adapted to our documents' field names:

curl  -X POST \
  'http://127.0.0.1:5984/mouveau/_index' \
  --header 'Accept: */*' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "type": "nouveau",
    "index": {
        "fields": [
            {"name": "title", "type": "string"},
            {"name": "servings", "type": "number"},
            {"name": "type", "type": "string"}
        ],
        "default_analyzer": "keyword"
    }
}'

Let’s do some queries. First, querying for strings, analogous to the Mango search by string test:

curl  -X POST \
  'http://127.0.0.1:5984/mouveau/_find' \
  --header 'Accept: */*' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "selector": {
      "type": "meat"
    }
}'

That works.

(Only showing selector lines for brevity from now on, the cURL command is always the same)

"selector": {
  "title": "Lamb Stew"
}

This doesn’t work at all:

{
  "error": "nouveau_search_error",
  "reason": "bad_request: field \"title_3astring\" was indexed without position data; cannot run PhraseQuery (phrase=title_3astring:\"lamb stew\")",
  "ref": 2009526893
}

Hm, maybe if we just search for a single word?

"selector": {
  "title": "Dumplings"
}

Different nope, but also nope:

{
  "docs": [],
  "bookmark": "W10="
}

What about numbers? This is analogous to the Mango search by number test:

"selector": {
  "servings": {"$gte": 6}
}

That works.

The last test is a $text query, same as the mango search by text test. Now, interestingly, the doc example for this query type selector looks different than the one in the test:

// test: just a keyword
{"$text": "hello"}

// docs: this is Lucene full-text search syntax
{
  "_id": { "$gt": null },
  "$text": "director:George"
}

Also, remember: the test runs this $text query against an index that does not have any "type":"text" fields! And surprisigly, this works!

"selector": {
  "$text": "lamb"
}

But this seems to be only a keyword search, and not a fully-fledged text search, which becomes apparent because these all fail:

"selector": {
  "$text": "lamb s"
}
"selector": {
  "$text": "title:lamb"
}
"selector": {
  "$text": "title:lam*"
}
"selector": {
  "$text": "lam*"
}

And all permutations thereof.

Summary:

  • number queries, as in the tests, work. Yay.
  • string queries, as in the tests, sometimes work. Couldn’t find a pattern yet.
  • $text queries against the index as used in the tests do not actually perform Lucene queries, but the same keyword-style searches as the string queries, but against all (string?) index fields.

Workaround attempts:

  • Change the order of the string queries to see what happens: no change, querying for title still fails, type still works.
  • Try different values for default_analyzer and analyzer.default, both standard and english do not change the results for $text queries.

What about the text index type?

Now, the test mango index and the example index in the docs are the same, and interestingly, the docs refer to this as a Text index. However, this section also describes a text index field type, just like in Nouveau, to go alongside string and number and boolean index field types. This isn’t documented or tested anywhere, but I’d assume this to work like so (delete the old ddoc first):

curl  -X POST \
  'http://127.0.0.1:5984/mouveau/_index' \
  --header 'Accept: */*' \
  --header 'Content-Type: application/json' \
  --data-raw '{
    "type": "nouveau",
    "index": {
        "fields": [
            {"name": "title", "type": "text"}
        ],
        "default_analyzer": "english"
    }
}'

But no query against this returns anything, no permutation of $text works, and using something like "title": "Dumplings" falls back to not using any index.

Expected results:

  • Queries against string index fields always work.
  • $text queries use Lucene syntax? Unsure what the intention is here, the docs do one thing, the tests another.
  • text type indexes are queryable with a selector like "$text": "fieldname:querystring".

espy avatar Mar 18 '25 14:03 espy

First point is that string fields are not for searching, and the docs say this, though perhaps it needs to be more prominent.

I can explain the odd results in detail;

  1. "title": "Lamb Stew"

This fails because the query is converted to a phrase query, and a string field is stored without position data (since there's only a single token). The error is definitely ugly, though.

  1. "title": "Dumplings"

This fails because the value is indexed as Dumplings but the query gets case-folded to dumplings and therefore doesn't match.

  1. "$text": "lamb"

This works because adds all string/text fields into the default field. Which is why "$text": "title:lamb" doesn't match, because field title was not case-folded at index time and Lamb != lamb.

rnewson avatar Mar 18 '25 20:03 rnewson

Not sure where to go here. At minimum the docs need enhancing to make it clear not to query on string fields, but perhaps those should be errors instead.

rnewson avatar Mar 18 '25 20:03 rnewson

figuring out the final bit on text fields. that seems like a genuine bug (clouseau only had string). what happens is that no fields are parsed out of the doc and thus it doesn't get indexed at all.

(great bug report btw, much appreciated)

rnewson avatar Mar 18 '25 20:03 rnewson

Right, had to re-familiarise myself with the obfuscating mango/_find layer and I think all of this is correct behaviour (🤮 , right?)

with _find you can only perform text queries on the default field (which is the concatenation of all your string fields). when you use the $text operator it constructs the query as $default:whateveryoutyped. So if you searched for title:foo you'd query $default:title:foo which is either a syntax error or at the very least won't match docs with 'foo' in the title field (and it's not even called 'title' in the index that mango creates).

So...

  1. don't search on the string fields, they can be sorted on though (they can also be used as facets but _find has never been able to reach that feature even with clouseau)
  2. when using $text remember it is searching across all string fields (indexed with the default analyzer, so don't set that to keyword if you want to do term, phrase, wildcard queries.

The real interface to nouveau is the _nouveau endpoint. None of these issues affect that. The index is built with the field names and types that you specify with the index function and the ?q= query parameter supports fielded search queries and everything else.

tldr;

the mango / _ find is a deliberately constrained interface to help users that know the mongo syntax and it has many limitations.

rnewson avatar Mar 19 '25 16:03 rnewson

from https://cloud.ibm.com/docs/Cloudant?topic=Cloudant-ibm-cloudant-query-parameters which says

The $text operator is based on a Lucene search with a standard analyzer. The operator isn't case-sensitive, and matches on any words. However, the $text operator doesn't support full Lucene syntax, such as wildcards, fuzzy matches, or proximity detection. For more information, see the Search documentation. The $text operator applies to all strings found in the document. If you place this operator in the context of a field name, it's invalid.

wish I'd remembered that earlier...

rnewson avatar Mar 19 '25 17:03 rnewson

how about we introduce a new operator, say $fulltext that can be used with more Lucene fidelity?

janl avatar Mar 28 '25 07:03 janl