couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

Unable to query fields indexed by nouveau

Open GMishx opened this issue 11 months ago • 26 comments

Description

I have compiled latest couchdb with ./configure --enable-nouveau and it is running fine. Even started the nouveau server with the created ./rel/couchdb/nouveau/bin/nouveau server.

Now, when I try to query the information from the indexes, it does not work for fields other than for email.

Steps to Reproduce

I have a sw360users database with following fields:

{
  "_id": "2b1086cef0a14b7eaeb6f0204b58b8cf",
  "_rev": "1-bb5f36a88a804eeba165b704090476b7",
  "type": "user",
  "email": "[email protected]",
  "userGroup": "CLEARING_ADMIN",
  "fullname": "Test Clearing1",
  "givenname": "Test",
  "lastname": "Clearing1"
}

Upon this DB, created a ddoc for nouveau with following document:

{
  "_id": "_design/nouveau_user",
  "nouveau": {
    "users": {
      "index": "function(doc) {\n  if (doc.type == 'user' ) {\n    if (typeof(doc.givenname) == 'string') {\n        index(\"string\", \"givenname\", doc.givenname, {\"store\": true});\n    }\n    if (typeof(doc.email) == 'string') {\n        index(\"string\", \"email\", doc.email, {\"store\": true});\n    }\n    if (typeof(doc.lastname) == 'string') {\n        index(\"string\", \"lastname\", doc.lastname, {\"store\": true});\n    }\n  }\n}",
      "default_analyzer": "english",
      "field_analyzers": {
        "email": "email"
      }
    }
  }
}

Here, I am indexing 3 fields, givenname, lastname and email. I tried various configurations by changing the positions of index() in the function, using different type of analyzers for creating the index.

I see no error in the nouveau logs or in the couchdb logs after the creation of ddoc. Thus, I relaxed :-)

Note: Responses are trimmed for brevity.

Now, when I queried all records with q=*:*, I get 10 fields since I have 10 users: $ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "*:*"}'

{"total_hits_relation":"EQUAL_TO","total_hits":10,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"[email protected]"}},{"order":[{"value":1.0,"@type":"float"},{"value":"2a7cedcf38e24ebbade7a23f3f07f793","@type":"string"}],"id":"2a7cedcf38e24ebbade7a23f3f07f793","fields":{"lastname":"Clearing2","givenname":"Test","email":"[email protected]"}}],"counts":null,"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiZWIzM2U2ZGI1YTE1NDAxNjgxMDg4OWQ4ZTU0NWZmODIiLCJAdHlwZSI6InN0cmluZyJ9XSxbeyJ2YWx1ZSI6MS4wLCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImVmMjMxYjQ5NTk3ZDRiZDViMmI4OThkNjcxODIwY2U3IiwiQHR5cGUiOiJzdHJpbmcifV1d"}

If I try to query with field email, I get expected response: $ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "email:setup*"}'

{"total_hits_relation":"EQUAL_TO","total_hits":1,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"[email protected]"}}],"counts":null,"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiMTFhN2MyOWRlZjJjNDMwNGE5N2RiODEyNTIxYmQ4MmMiLCJAdHlwZSI6InN0cmluZyJ9XV0="}

But with field lastname, I get nothing: $ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "lastname:Administrator"}'

{"total_hits_relation":"EQUAL_TO","total_hits":0,"ranges":null,"hits":[],"counts":null,"bookmark":"W10="}

Tried multiple times with lastname:admin*, lastname:administrator, lastname:Administrator but failed to get any response even with different analyzers. The behavior is same for the other field givename. Querying only works for email with different lucene syntax.

Expected Behaviour

Expected to query the indexes on different fields as well.

Your Environment

$ curl --user "admin:admin" 'http://localhost:5984'

{"couchdb":"Welcome","version":"3.3.3-29db2df","git_sha":"29db2df","uuid":"8722f4f42d4f2d566be241e6035df095","features":["nouveau","access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}

Nouveau is also configured in default.ini using default ./rel/couchdb/etc/nouveau.yaml:

[nouveau]
enable = true
url = http://127.0.0.1:5987
  • CouchDB version used: 3.3.3-29db2df
  • Browser name and version: curl 8.2.1 (x86_64-conda-linux-gnu) libcurl/8.2.1 OpenSSL/3.0.10 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.52.0
  • Operating system and version: Ubuntu 22.04.4 LTS

Additional Context

Using counts to aggregate index values works just as expected. $ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "*:*", "counts": ["lastname"]}'

{"total_hits_relation":"EQUAL_TO","total_hits":10,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"[email protected]"}},{"order":[{"value":1.0,"@type":"float"},{"value":"2a7cedcf38e24ebbade7a23f3f07f793","@type":"string"}],"id":"2a7cedcf38e24ebbade7a23f3f07f793","fields":{"lastname":"Clearing2","givenname":"Test","email":"[email protected]"}}],"counts":{"lastname":{"User2":1,"User1":1,"User":1,"Clearing2":1,"Clearing1":1,"Clearing":1,"Administrator":1,"Admin2":1,"Admin1":1,"Admin":1}},"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiZWIzM2U2ZGI1YTE1NDAxNjgxMDg4OWQ4ZTU0NWZmODIiLCJAdHlwZSI6InN0cmluZyJ9XSxbeyJ2YWx1ZSI6MS4wLCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImVmMjMxYjQ5NTk3ZDRiZDViMmI4OThkNjcxODIwY2U3IiwiQHR5cGUiOiJzdHJpbmcifV1d"}

$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau_info/users'

{"name":"_design/nouveau_user/users","search_index":{"update_seq":628,"purge_seq":0,"num_docs":10,"disk_size":8521}}

GMishx avatar Feb 29 '24 05:02 GMishx

thank you for the detailed report, I will look into it.

My first thought is that the query parser is transforming your "Administrator" to "administrator", but as it was indexed as "string" and not "text" it is held as "Administrator" in the index itself, and thus doesn't match.

assuming that's it then I agree that the query parser should not do this for string fields and I will make a fix.

rnewson avatar Feb 29 '24 11:02 rnewson

But I tried to query with "admin", "administrator" and "Administrator" without any luck. Same holds for other values like "User". Thus I am suspecting something is wrong with the analyzed value from index or query not giving out the same result. Email being a special case where analyzer does not change the value, it matches.

GMishx avatar Feb 29 '24 13:02 GMishx

Yes, I mean in the index it is "Administrator" but you are not able query with the "A" as the query parser converts with the standard analyzer. you might say 'q=foo:Administrator' but the query parser is making it a term query on "administrator".

rnewson avatar Feb 29 '24 16:02 rnewson

e.g, if you specified the "keyword" analyzer for the lastName field, the query parser won't lowercase it for you and it should then match.

rnewson avatar Feb 29 '24 16:02 rnewson

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/StringField.html vs https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/TextField.html btw

rnewson avatar Feb 29 '24 16:02 rnewson

Yes, I mean in the index it is "Administrator" but you are not able query with the "A" as the query parser converts with the standard analyzer. you might say 'q=foo:Administrator' but the query parser is making it a term query on "administrator".

Can confirm that's the case. I modified the doc to give the "lastname": "administrator", then the query q=lastname:admin* gave back the expected result.

GMishx avatar Mar 01 '24 05:03 GMishx

https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/StringField.html vs https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/TextField.html btw

I also modified the index to be of type text rather than string as: index("text", "lastname", doc.lastname, {"store": true}); and query started working even for uppercase "Admin1".

If that's expected behavior, I can update the index creation doc.

GMishx avatar Mar 01 '24 05:03 GMishx

"text" type means the value is analyzed. the Lucene analyzers typically force to lower case among other effects, which explains the new success.

rnewson avatar Mar 01 '24 11:03 rnewson

Yeah, I got to understand that from the (archived) couchdb-lucene project

@rnewson , you'd still like to work on this (I can also try to check the issue) with the lower-casing of string query. Or should I close the issue as I was using wrong field type?

GMishx avatar Mar 01 '24 11:03 GMishx

I still intend to make an enhancement. Assuming I'm right in my first comment you did nothing wrong, and I would like nouveau to do the right thing.

We know that "string" fields will not be analyzed, we need to tell the query parser to also not analyze the query string for "string" fields (and nouveau knows the index definition, so it does know which fields are "string" or "text", etc).

rnewson avatar Mar 01 '24 12:03 rnewson

rehi (I've been out on vacation the last few weeks).

I've mocked up a few approaches to this locally and I don't like any of them, they all either have a non-trivial overhead or other odd side-effects more surprising than what you've encountered.

I think the right move is to clarify that if you index with type "string" and you intend to search on that field (as opposed to only sorting on it, for example), then you need to specify the "keyword" analyzer for that field in the index definition. If you do that, everything works out nicely.

In your case I think you actually do want the "text" type for "lastName", so that you can search case-insensitively, but only you know for sure.

rnewson avatar Mar 26 '24 09:03 rnewson

Hey, thanks for the updates. The documentation makes the type field more clear in #5018.

GMishx avatar Mar 26 '24 09:03 GMishx

no problem!

rnewson avatar Mar 26 '24 09:03 rnewson

@rnewson Even using the provided suggestions from #5018 fail for different case.

I have a document with fields name and version. Version is stored as a string in couchdb and sample values are "1", "2", "4.2.0". With default analyzer (simple_asciifolding), these values were getting lost:

$ curl --user "admin:admin" 'http://localhost:5984/_nouveau_analyze' -X POST -H 'Content-Type: application/json' -d '{"analyzer": "simple_asciifolding", "text": "4.2.0"}' | jq
{
  "tokens": []
}

Thus I created the index as following:

{
  "_id": "_design/lucene",
  "_rev": "238-6e02d3801cc64311f5244cb242855e82",
  "nouveau": {
    "projects": {
      "default_analyzer": "keyword",
      "field_analyzers": {
        "version": "keyword"
      },
      "index": "function(doc) {\nif(doc.version !== undefined && doc.version != null && doc.version.length >0) {\n      index('text', 'version', doc.version, {'store': true});\n    }\n}"
    }
}

Notice I added field_analyzer on the version field to keyword as #5018 suggested. I also tried using field type as string as well as text. But in all cases, I cannot query the document containing "version:4.2.0" or "version:1".

GMishx avatar Mar 28 '24 10:03 GMishx

You specify the keyword analyzer for all fields. for input "4.2.0" that tokenizes to "4.2.0".

Can you show the result of querying the view with ?q=version:4.2.0, ?q=version:"4.2.0" and finally a ?q=_id: with the doc id of the doc with a version of 4.2.0 (e.g, ?q=_id:doc1).

rnewson avatar Mar 28 '24 10:03 rnewson

Here are the outputs as requested. I am getting same results for GET and POST queries.

Output of analyze:

$ curl --user "admin:admin" --silent 'http://localhost:5984/_nouveau_analyze' -X POST -H 'Content-Type: application/json' -d '{"analyzer": "keyword", "text": "4.2.0"}' | jq
{
  "tokens": [
    "4.2.0"
  ]
}

Output of version:4.2.0:

$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "version:4.2.0"}' | jq
{
  "total_hits_relation": "EQUAL_TO",
  "total_hits": 0,
  "ranges": null,
  "hits": [],
  "counts": null,
  "bookmark": "W10="
}

Output of version:"4.2.0":

$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "version:\"4.2.0\""}' | jq
{
  "total_hits_relation": "EQUAL_TO",
  "total_hits": 0,
  "ranges": null,
  "hits": [],
  "counts": null,
  "bookmark": "W10="
}

Output with the doc match:

$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "_id:c26ccf0179c14f22bf6f9d0d55acd2dc"}' | jq
{
  "total_hits_relation": "EQUAL_TO",
  "total_hits": 1,
  "ranges": null,
  "hits": [
    {
      "order": [
        {
          "value": 0.44583148,
          "@type": "float"
        },
        {
          "value": "c26ccf0179c14f22bf6f9d0d55acd2dc",
          "@type": "string"
        }
      ],
      "id": "c26ccf0179c14f22bf6f9d0d55acd2dc",
      "fields": {
        "version": "4.2.0",
        "state": "ACTIVE",
        "projectType": "CUSTOMER",
        "name": "fossology",
        "clearingState": "IN_PROGRESS",
        "businessUnit": "DEPARTMENT"
      }
    }
  ],
  "counts": null,
  "bookmark": "W1t7InZhbHVlIjowLjQ0NTgzMTQ4LCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImMyNmNjZjAxNzljMTRmMjJiZjZmOWQwZDU1YWNkMmRjIiwiQHR5cGUiOiJzdHJpbmcifV1d"
}

GMishx avatar Mar 28 '24 10:03 GMishx

thanks.

rnewson avatar Mar 28 '24 11:03 rnewson

ok, the short answer is that the (nouveau-specific) query parser interprets "4.2.0" as a number and performs a numeric query, not a text/string query. I'm surprised by that, but obviously the same would be true for "4", etc.

This is a very helpful thread btw, these are exactly the issues I want to confront before removing the 'experimental' label from nouveau.

rnewson avatar Mar 28 '24 11:03 rnewson

jshell> var nf = NumberFormat.getInstance(Locale.getDefault());
nf ==> java.text.DecimalFormat@674dc

jshell> nf.parse("4");
$5 ==> 4

jshell> nf.parse("4.2");
$6 ==> 4.2

jshell> nf.parse("4.2.0");
$7 ==> 4.2

That's core Java behaviour.

rnewson avatar Mar 28 '24 11:03 rnewson

BTW for context, I am translating the project sw360 which currently uses couchdb-lucene to nouveau.

GMishx avatar Mar 28 '24 11:03 GMishx

that's helpful to know, thanks. I'm looking at changing the "magical" nature of numeric queries. I extended/altered the basic lucene query syntax to auto-detect numbers but it has always been a bit awkward (as you've re-discovered).

so I'm looking at a syntax extension that lets you tell nouveau that you intend to look for "2" as a string or as a number, explicitly.

rnewson avatar Mar 28 '24 13:03 rnewson

posted a draft PR that addresses this, with some extensive prose on whether it's a good idea or not.

rnewson avatar Mar 29 '24 17:03 rnewson

Will it make sense to use the field type of the index? We already have types double, string and text to differentiate the types of values. (just guessing)

GMishx avatar Apr 01 '24 04:04 GMishx

@GMishx I merged a fix for this, but note that I had to change how some things work (you can see the documentation diff in https://github.com/apache/couchdb/pull/5021). Essentially you don't need to put a type indicator at the end of the field name when sorting.

what should now happen is you can index a field as a number or a string and the right kind of query will be used. Please give it a try.

rnewson avatar Apr 10 '24 16:04 rnewson

I can confirm the indexing is now working as expected for mentioned issue. Thanks for the quick fix @rnewson I can index and query values "4.2.0", "1" and "2".

I will test it further with other values as well and update here.

GMishx avatar Apr 15 '24 07:04 GMishx

thanks for the confirmation, I like this change and your issue was the nudge I needed to make this improvement, so thank you.

rnewson avatar Apr 15 '24 09:04 rnewson