couchdb
couchdb copied to clipboard
Unable to query fields indexed by nouveau
Description
I have compiled latest couchdb with ./configure --enable-nouveau
and it is running fine. Even started the nouveau server with the created ./rel/couchdb/nouveau/bin/nouveau server
.
Now, when I try to query the information from the indexes, it does not work for fields other than for email
.
Steps to Reproduce
I have a sw360users
database with following fields:
{
"_id": "2b1086cef0a14b7eaeb6f0204b58b8cf",
"_rev": "1-bb5f36a88a804eeba165b704090476b7",
"type": "user",
"email": "[email protected]",
"userGroup": "CLEARING_ADMIN",
"fullname": "Test Clearing1",
"givenname": "Test",
"lastname": "Clearing1"
}
Upon this DB, created a ddoc for nouveau with following document:
{
"_id": "_design/nouveau_user",
"nouveau": {
"users": {
"index": "function(doc) {\n if (doc.type == 'user' ) {\n if (typeof(doc.givenname) == 'string') {\n index(\"string\", \"givenname\", doc.givenname, {\"store\": true});\n }\n if (typeof(doc.email) == 'string') {\n index(\"string\", \"email\", doc.email, {\"store\": true});\n }\n if (typeof(doc.lastname) == 'string') {\n index(\"string\", \"lastname\", doc.lastname, {\"store\": true});\n }\n }\n}",
"default_analyzer": "english",
"field_analyzers": {
"email": "email"
}
}
}
}
Here, I am indexing 3 fields, givenname
, lastname
and email
. I tried various configurations by changing the positions of index()
in the function, using different type of analyzers for creating the index.
I see no error in the nouveau logs or in the couchdb logs after the creation of ddoc. Thus, I relaxed :-)
Note: Responses are trimmed for brevity.
Now, when I queried all records with q=*:*
, I get 10 fields since I have 10 users:
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "*:*"}'
{"total_hits_relation":"EQUAL_TO","total_hits":10,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"[email protected]"}},{"order":[{"value":1.0,"@type":"float"},{"value":"2a7cedcf38e24ebbade7a23f3f07f793","@type":"string"}],"id":"2a7cedcf38e24ebbade7a23f3f07f793","fields":{"lastname":"Clearing2","givenname":"Test","email":"[email protected]"}}],"counts":null,"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiZWIzM2U2ZGI1YTE1NDAxNjgxMDg4OWQ4ZTU0NWZmODIiLCJAdHlwZSI6InN0cmluZyJ9XSxbeyJ2YWx1ZSI6MS4wLCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImVmMjMxYjQ5NTk3ZDRiZDViMmI4OThkNjcxODIwY2U3IiwiQHR5cGUiOiJzdHJpbmcifV1d"}
If I try to query with field email
, I get expected response:
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "email:setup*"}'
{"total_hits_relation":"EQUAL_TO","total_hits":1,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"[email protected]"}}],"counts":null,"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiMTFhN2MyOWRlZjJjNDMwNGE5N2RiODEyNTIxYmQ4MmMiLCJAdHlwZSI6InN0cmluZyJ9XV0="}
But with field lastname
, I get nothing:
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "lastname:Administrator"}'
{"total_hits_relation":"EQUAL_TO","total_hits":0,"ranges":null,"hits":[],"counts":null,"bookmark":"W10="}
Tried multiple times with lastname:admin*
, lastname:administrator
, lastname:Administrator
but failed to get any response even with different analyzers. The behavior is same for the other field givename
. Querying only works for email with different lucene syntax.
Expected Behaviour
Expected to query the indexes on different fields as well.
Your Environment
$ curl --user "admin:admin" 'http://localhost:5984'
{"couchdb":"Welcome","version":"3.3.3-29db2df","git_sha":"29db2df","uuid":"8722f4f42d4f2d566be241e6035df095","features":["nouveau","access-ready","partitioned","pluggable-storage-engines","reshard","scheduler"],"vendor":{"name":"The Apache Software Foundation"}}
Nouveau is also configured in default.ini
using default ./rel/couchdb/etc/nouveau.yaml
:
[nouveau]
enable = true
url = http://127.0.0.1:5987
- CouchDB version used:
3.3.3-29db2df
- Browser name and version:
curl 8.2.1 (x86_64-conda-linux-gnu) libcurl/8.2.1 OpenSSL/3.0.10 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.52.0
- Operating system and version:
Ubuntu 22.04.4 LTS
Additional Context
Using counts
to aggregate index values works just as expected.
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau/users' -X POST -H 'Content-Type: application/json' -d '{"q": "*:*", "counts": ["lastname"]}'
{"total_hits_relation":"EQUAL_TO","total_hits":10,"ranges":null,"hits":[{"order":[{"value":1.0,"@type":"float"},{"value":"11a7c29def2c4304a97db812521bd82c","@type":"string"}],"id":"11a7c29def2c4304a97db812521bd82c","fields":{"lastname":"Administrator","givenname":"Setup","email":"[email protected]"}},{"order":[{"value":1.0,"@type":"float"},{"value":"2a7cedcf38e24ebbade7a23f3f07f793","@type":"string"}],"id":"2a7cedcf38e24ebbade7a23f3f07f793","fields":{"lastname":"Clearing2","givenname":"Test","email":"[email protected]"}}],"counts":{"lastname":{"User2":1,"User1":1,"User":1,"Clearing2":1,"Clearing1":1,"Clearing":1,"Administrator":1,"Admin2":1,"Admin1":1,"Admin":1}},"bookmark":"W1t7InZhbHVlIjoxLjAsIkB0eXBlIjoiZmxvYXQifSx7InZhbHVlIjoiZWIzM2U2ZGI1YTE1NDAxNjgxMDg4OWQ4ZTU0NWZmODIiLCJAdHlwZSI6InN0cmluZyJ9XSxbeyJ2YWx1ZSI6MS4wLCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImVmMjMxYjQ5NTk3ZDRiZDViMmI4OThkNjcxODIwY2U3IiwiQHR5cGUiOiJzdHJpbmcifV1d"}
$ curl --user "admin:admin" 'http://localhost:5984/sw360users/_design/nouveau_user/_nouveau_info/users'
{"name":"_design/nouveau_user/users","search_index":{"update_seq":628,"purge_seq":0,"num_docs":10,"disk_size":8521}}
thank you for the detailed report, I will look into it.
My first thought is that the query parser is transforming your "Administrator" to "administrator", but as it was indexed as "string" and not "text" it is held as "Administrator" in the index itself, and thus doesn't match.
assuming that's it then I agree that the query parser should not do this for string fields and I will make a fix.
But I tried to query with "admin", "administrator" and "Administrator" without any luck. Same holds for other values like "User". Thus I am suspecting something is wrong with the analyzed value from index or query not giving out the same result. Email being a special case where analyzer does not change the value, it matches.
Yes, I mean in the index it is "Administrator" but you are not able query with the "A" as the query parser converts with the standard analyzer. you might say 'q=foo:Administrator' but the query parser is making it a term query on "administrator".
e.g, if you specified the "keyword" analyzer for the lastName field, the query parser won't lowercase it for you and it should then match.
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/StringField.html vs https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/TextField.html btw
Yes, I mean in the index it is "Administrator" but you are not able query with the "A" as the query parser converts with the standard analyzer. you might say 'q=foo:Administrator' but the query parser is making it a term query on "administrator".
Can confirm that's the case. I modified the doc to give the "lastname": "administrator"
, then the query q=lastname:admin*
gave back the expected result.
https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/StringField.html vs https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/document/TextField.html btw
I also modified the index to be of type text
rather than string
as: index("text", "lastname", doc.lastname, {"store": true});
and query started working even for uppercase "Admin1".
If that's expected behavior, I can update the index creation doc.
"text" type means the value is analyzed. the Lucene analyzers typically force to lower case among other effects, which explains the new success.
Yeah, I got to understand that from the (archived) couchdb-lucene project
@rnewson , you'd still like to work on this (I can also try to check the issue) with the lower-casing of string query. Or should I close the issue as I was using wrong field type?
I still intend to make an enhancement. Assuming I'm right in my first comment you did nothing wrong, and I would like nouveau to do the right thing.
We know that "string" fields will not be analyzed, we need to tell the query parser to also not analyze the query string for "string" fields (and nouveau knows the index definition, so it does know which fields are "string" or "text", etc).
rehi (I've been out on vacation the last few weeks).
I've mocked up a few approaches to this locally and I don't like any of them, they all either have a non-trivial overhead or other odd side-effects more surprising than what you've encountered.
I think the right move is to clarify that if you index with type "string" and you intend to search on that field (as opposed to only sorting on it, for example), then you need to specify the "keyword" analyzer for that field in the index definition. If you do that, everything works out nicely.
In your case I think you actually do want the "text" type for "lastName", so that you can search case-insensitively, but only you know for sure.
Hey, thanks for the updates. The documentation makes the type field more clear in #5018.
no problem!
@rnewson Even using the provided suggestions from #5018 fail for different case.
I have a document with fields name and version. Version is stored as a string in couchdb and sample values are "1", "2", "4.2.0"
. With default analyzer (simple_asciifolding
), these values were getting lost:
$ curl --user "admin:admin" 'http://localhost:5984/_nouveau_analyze' -X POST -H 'Content-Type: application/json' -d '{"analyzer": "simple_asciifolding", "text": "4.2.0"}' | jq
{
"tokens": []
}
Thus I created the index as following:
{
"_id": "_design/lucene",
"_rev": "238-6e02d3801cc64311f5244cb242855e82",
"nouveau": {
"projects": {
"default_analyzer": "keyword",
"field_analyzers": {
"version": "keyword"
},
"index": "function(doc) {\nif(doc.version !== undefined && doc.version != null && doc.version.length >0) {\n index('text', 'version', doc.version, {'store': true});\n }\n}"
}
}
Notice I added field_analyzer
on the version
field to keyword
as #5018 suggested. I also tried using field type as string
as well as text
. But in all cases, I cannot query the document containing "version:4.2.0"
or "version:1"
.
You specify the keyword
analyzer for all fields. for input "4.2.0" that tokenizes to "4.2.0".
Can you show the result of querying the view with ?q=version:4.2.0
, ?q=version:"4.2.0"
and finally a ?q=_id:
with the doc id of the doc with a version of 4.2.0 (e.g, ?q=_id:doc1
).
Here are the outputs as requested. I am getting same results for GET and POST queries.
Output of analyze:
$ curl --user "admin:admin" --silent 'http://localhost:5984/_nouveau_analyze' -X POST -H 'Content-Type: application/json' -d '{"analyzer": "keyword", "text": "4.2.0"}' | jq
{
"tokens": [
"4.2.0"
]
}
Output of version:4.2.0
:
$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "version:4.2.0"}' | jq
{
"total_hits_relation": "EQUAL_TO",
"total_hits": 0,
"ranges": null,
"hits": [],
"counts": null,
"bookmark": "W10="
}
Output of version:"4.2.0"
:
$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "version:\"4.2.0\""}' | jq
{
"total_hits_relation": "EQUAL_TO",
"total_hits": 0,
"ranges": null,
"hits": [],
"counts": null,
"bookmark": "W10="
}
Output with the doc match:
$ curl --user "admin:admin" --silent 'http://localhost:5984/sw360db/_design/lucene/_nouveau/projects' -X POST -H 'Content-Type: application/json' -d '{"q": "_id:c26ccf0179c14f22bf6f9d0d55acd2dc"}' | jq
{
"total_hits_relation": "EQUAL_TO",
"total_hits": 1,
"ranges": null,
"hits": [
{
"order": [
{
"value": 0.44583148,
"@type": "float"
},
{
"value": "c26ccf0179c14f22bf6f9d0d55acd2dc",
"@type": "string"
}
],
"id": "c26ccf0179c14f22bf6f9d0d55acd2dc",
"fields": {
"version": "4.2.0",
"state": "ACTIVE",
"projectType": "CUSTOMER",
"name": "fossology",
"clearingState": "IN_PROGRESS",
"businessUnit": "DEPARTMENT"
}
}
],
"counts": null,
"bookmark": "W1t7InZhbHVlIjowLjQ0NTgzMTQ4LCJAdHlwZSI6ImZsb2F0In0seyJ2YWx1ZSI6ImMyNmNjZjAxNzljMTRmMjJiZjZmOWQwZDU1YWNkMmRjIiwiQHR5cGUiOiJzdHJpbmcifV1d"
}
thanks.
ok, the short answer is that the (nouveau-specific) query parser interprets "4.2.0" as a number and performs a numeric query, not a text/string query. I'm surprised by that, but obviously the same would be true for "4", etc.
This is a very helpful thread btw, these are exactly the issues I want to confront before removing the 'experimental' label from nouveau.
jshell> var nf = NumberFormat.getInstance(Locale.getDefault());
nf ==> java.text.DecimalFormat@674dc
jshell> nf.parse("4");
$5 ==> 4
jshell> nf.parse("4.2");
$6 ==> 4.2
jshell> nf.parse("4.2.0");
$7 ==> 4.2
That's core Java behaviour.
BTW for context, I am translating the project sw360 which currently uses couchdb-lucene to nouveau.
that's helpful to know, thanks. I'm looking at changing the "magical" nature of numeric queries. I extended/altered the basic lucene query syntax to auto-detect numbers but it has always been a bit awkward (as you've re-discovered).
so I'm looking at a syntax extension that lets you tell nouveau that you intend to look for "2" as a string or as a number, explicitly.
posted a draft PR that addresses this, with some extensive prose on whether it's a good idea or not.
Will it make sense to use the field type of the index? We already have types double
, string
and text
to differentiate the types of values. (just guessing)
@GMishx I merged a fix for this, but note that I had to change how some things work (you can see the documentation diff in https://github.com/apache/couchdb/pull/5021). Essentially you don't need to put a type indicator at the end of the field name when sorting.
what should now happen is you can index a field as a number or a string and the right kind of query will be used. Please give it a try.
I can confirm the indexing is now working as expected for mentioned issue. Thanks for the quick fix @rnewson I can index and query values "4.2.0", "1" and "2".
I will test it further with other values as well and update here.
thanks for the confirmation, I like this change and your issue was the nudge I needed to make this improvement, so thank you.