couchdb
couchdb copied to clipboard
Mango queries should not use other index when use_index is specified
Description
When a mango query includes a use_index
option and the specified index does not exist, CouchDB might fallback on an existing index, which can cause unexpected behavior.
Steps to Reproduce
Consider the following mango query:
{
"selector": {
"worker": "service",
"$or": [
{
"state": "queued"
},
{
"state": "running"
}
]
},
"sort": [
{
"worker": "asc"
}
],
"use_index": "_design/by-worker"
}
Now suppose the _design/by-worker
index does not exist, but there is an existing one on the same fields:
{
"index": {
"fields": ["worker", "state"]
},
"ddoc": "_design/by-worker-and-state",
"type" : "json"
}
The _design/by-worker-and-state
index will actually be used. However, this can be problematic if the database is quite huge: the $or
will be evaluated in-memory, which could result in timeouts as the whole database will be scanned.
In our client logic, we first run the mango query and expect an error, if the index does not exist. Then, we create the index, and run the query again. This avoids to check if the index exists for every mango query, as it will be the case most of the time. See the logic here.
In our example, if the index does not exist, we would create one with a partial filter:
{
"index": {
"fields": ["worker"],
"partial_filter_selector": {
"$or": [
{
"state": "queued"
},
{
"state": "running"
}
]
}
},
"ddoc": "_design/by-worker-and-partial-state",
"type" : "json"
}
This avoids to evaluate the $or
at query time, and potential timeouts. But because of the existing index being used, we never have the opportunity to create the actual and efficient index.
Expected Behaviour
We expect any mango query including a use_index
to immediately fails if the index does not exist. This seems reasonable to force the use of a particular index without any fallback, as it might lead to sub-optimal queries.
Your Environment
- CouchDB version used: 3.2.1
- Browser name and version: Not relevant
- Operating system and version: Not relevant
Additional Context
This issue can be convoluted by specifying a sort on fields that are not used on the existing index, but this is not really a solution for us.
I would like to work on this ticket. Committers, please assign it to me.
Hi @paultranvan, I have studied the source code and it seems this behavior is intended (see #962). However, I see what you want to achieve here and why this behavior causes a problem.
On the side of CouchDB, a possible solution would be to adjust the API so that the desired semantics could be requested. That is, by passing, say use_index={"candidates": ["_design/by_worker"], "strict": true}
will result HTTP 400 (Bad Parameters) when the requested index is missing.
But I think the whole issue could be managed on the client side. Why not to store the list of known indexes there and use it to decide on each query if an index has been in use? This might be even faster as you would not have to look out for errors constantly but get the desired result by a single query. When needed, the list of known indexes could be initialized / synced from a CouchDB /{db}/_index
GET
request.
Hi @pgj, thank you for your answer.
Yes it could be handled by the client, but it is quite cumbersome because you have to deal with this list state and update it when there are add/edit/remove on indexes. Typically, if an index is not in the list, you don't know if it's because it does not exist yet or simply because the list is not up-to-date, which could happen if there are several clients.
That's why I prefer the idea of adjusting the API e.g. with a strict: true
param :)
I like the idea of an optional behaviour to require strict index usage, and probably should have a toggle for the default behaviour in default.ini
as well as an API option. I worry a bit about overloading the type in use_index
as I know this causes problems with some clients as well as being difficult to specify in OpenAPI, though I know it is already a bit confused in allowing either a string or array. A distinct strict_index_selection
or similar field would likely be simpler in that regard, and I wonder whether we might want to use it to toggle index fall-back behaviour more generally (i.e. error instead of using all_docs
if no index is available to service a query)?
I proposed the API option because that can be controlled per query not per deployment. My impression is that users are mostly fine with the original behavior but there are use cases (like the one featured in this issue) where "more strictness" would be preferred. Nevertheless I can be fine with controlling the default from default.ini
and let it be overridden per-query, which could then work in both directions.
I fully accept the criticism about overloading use_index
even more. Adding a separate flag is indeed a cleaner approach. Extending the approach to the generic case can also be a good move -- similarly to this use case, there may be others where it is not preferred to execute suboptimal queries but return an error.