couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

feat(`mango`): strict index selection

Open pgj opened this issue 2 years ago • 59 comments
trafficstars

It is not always beneficial for the performance if the Mango query planner tries to assign an index to the selector. User-specified indexes may save the day, but since they are only hints for the planner, automated overrides can still happen.

Introduce the concept of "strict index selection" (h/t @willholley) which lets the user to request the exclusive use of a specific index. When it is not possible, give up on planning and return an HTTP 400 response right away. This way the user has the chance to learn about the missing index, request its creation and try again later.

The feature comes with a configuration toggle. By default, the feature is disabled to maintain backward compatibility, but the user may ask for this behavior via the new use_index_strict query parameter for a specific query. Note that, similarly, use_index_strict could be used to disable strict index selection temporarily when it is enabled on the server.

Fixes #4511

Checklist for the review

  • [x] Code is written and works correctly
  • [x] Changes are covered by tests
  • [x] Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • [x] Documentation changes were made in the src/docs folder

pgj avatar Jul 30 '23 15:07 pgj

/$dbname/_design/$ddoc/_find/$indexname

we already have use_index as the in-JSON way to specify the index. afaict, this PR just makes it mandatory.

I have qualms with adding the URL-way on top of this and later possibly deprecating use_index, but this does not have to hold this particular PR for me.

janl avatar Aug 01 '23 11:08 janl

I agree with @janl. I would be happy to work on making the Mango interface to be consistent with that of Search (I like the idea), but not in the scope of this PR.

pgj avatar Aug 01 '23 11:08 pgj

I disagree. The mistake when use_index was added, where it was allowed to not use the index requested, is embedded now. Having a "I really mean it" option, that is optional, and off by default, makes this interface worse, not better.

We should either fix the associated issue (prevent use_index from not using any index other than the one requested) or not. An optional fudge is worse.

rnewson avatar Aug 01 '23 12:08 rnewson

I see. But we should keep use_index around as it is now for a while for backward compatibility, right?

pgj avatar Aug 01 '23 13:08 pgj

I see. But we should keep use_index around as it is now for a while for backward compatibility, right?

Yes.

An optional fudge is worse.

I wouldn't frame it these strict terms. I see this PR as a step towards making the Mango-query-index-selection API look similar to JS views and search APIs. IF this PR doesn't get us the whole step there, that’s fine by me as long as we arrive there eventually. IF we want to fix this all in one go, that’s fine by me, but for BC reasons we will need a toggle of some sort for a while and time for folks to migrate off of the no longer preferred APIs.

janl avatar Aug 01 '23 13:08 janl

the behaviour of use_index can't easily be changed (or, I think we'd all agree, fixed), short of a major version bump.

If there was a roadmap here for couchdb 4.0 (handwaving, but imagine that use_index as an option goes away entirely. you either use /dbname/_find and let couchdb choose, or you use /path/to/specific/index/_find to choose for yourself).

Without that, it's hard to decide whether adding more options is taking us somewhere better or just somewhere more confusing. As a user I would be legitimately baffled to be told I should use use_index_strict when I report that my use_index directive is being ignored.

I'm not sure if I'm -0 or -1 on this, I'm veering to -0 though. I don't like leaving use_index as-is (it should never have been merely a hint) and I don't like adding another argument to do what use_index should have done.

Hence I suggested having _find also reachable from a url that clearly is associated with a specific index resource, like we have for all the other kinds of index we have. That endpoint can return a 400 if the request parameters cannot be executed by the index.

That idea also gives users a migration path away from the flawed choosing algorithm. A future (major) release of CouchDB might then drop /dbname/_find entirely.

rnewson avatar Aug 01 '23 13:08 rnewson

In general, for API migrations we loosely agreed on: introduce a better API, eventually retire the old one, maybe.

In that light, adding the REST-URL-selects-index route is adding the better API and we can later remove raw _find when we think the time is right.

That is indeed cleaner than adding another toggle to change the behaviour of use_index.

I have no problems with that approach.

janl avatar Aug 01 '23 13:08 janl

@janl I don't see how this is a step towards "making the Mango-query-index-selection API look similar to JS views and search APIs". my suggestion of adding /path/to/specific/index/_find does that, but expanding /dbname/_find does not. Can you explain the step you see?

rnewson avatar Aug 01 '23 13:08 rnewson

@janl we commented concurrently. I agree with your last. :)

rnewson avatar Aug 01 '23 13:08 rnewson

noting use_index should be prohibited (400 Bad Request) for the /path/to/index/_find variant

rnewson avatar Aug 01 '23 13:08 rnewson

That idea also gives users a migration path away from the flawed choosing algorithm. A future (major) release of CouchDB might then drop /dbname/_find entirely.

Why to drop /{db}/find entirely? Why not to fix the index selection algorithm instead? I understand that it might be a nuisance for the users that a less optimal index is selected, but that is why I started to work on #4662 for example.

pgj avatar Aug 01 '23 14:08 pgj

Why to drop /{db}/find entirely?

it was a mistake to allow unindexed queries. We are trying to reverse that mistake now :)

janl avatar Aug 01 '23 14:08 janl

A revisited version of /{db}/find may refuse to run unindexed queries, it does not have to be necessarily removed because currently it makes that possible. Or am I missing something here?

pgj avatar Aug 01 '23 14:08 pgj

sure, but why leave it if we have a better API in place?

janl avatar Aug 01 '23 14:08 janl

I am not sure I got you. In my understanding, /{db}/find provides the possibility of letting CouchDB choose the index for the query. Removing this endpoint would lose this feature. It is hard for me to compare this with explicitly naming the index in the URL — there is no "better" or "worse" API in that sense, I believe. These are two different answers to the same problem: manual vs. automated index selection.

pgj avatar Aug 01 '23 14:08 pgj

removing /dbname/_find is only a possibility (that could only happen in a major version bump). I mention it as part of framing any improvement here. Adding the index-specific_find is the couchdb-shaped way to achieve "strict index selection", as that's how it's done for map/reduce and search indexes today. If, in the future, we find that /dbname/_find is much less used / actively avoided / a source of confusion not joy, we could deprecate it.

rnewson avatar Aug 01 '23 14:08 rnewson

More advanced query engines allow selecting multiple indices to answer a query. It's not impossible, for Mango to be improved to do it as well. For instance, one pain point of Mango is it can't use indices for general disjunctions today; but it might be possible to do that by doing a merge on two indices. If the new Mango query API is to specify only one index, that API choice will work against that improvement.

Another, more pessimistic view, I guess is to say Mango is never going to be anything but a layer on top of an mrview index and we shouldn't pretend otherwise. Then selecting an index and forcing that there be an index defined is probably reasonable.

nickva avatar Aug 01 '23 15:08 nickva

I agree with Nick that we would not want to put ourselves in the corner by embedding assumption about single index into endpoint name.

I also agree with Bob and Jan that adding more option to existing endpoint is not the best choice in this particular case. Specifically because the option would be off by default for compatibility reasons. Also adding new options to existing API endpoints complicates maintaining of client libraries.

I think we should come up with a new endpoint name but shouldn't embed index name into it.

I am bad at naming and naming is hard.

/{dbname}/_find_indexed /{dbname}/_select - // inspired by SQL /{dbname}/_find_select - // find using selection of indexes

Let's ask help from @ricellis who has an extensive experience in designing APIs.

The decision whether to keep _find in the future or remove it is out of the scope for this PR.

iilyak avatar Aug 01 '23 19:08 iilyak

The stated purpose of this PR is to allow a user to run a selector against a specific individual index or receive an error stating that the index is not compatible, rather than fallback to a less efficient index.

Let us focus on the best way to achieve that specific goal, and whether to then do so.

Speculation about the future of /dbname/_find has muddied the waters. I am sorry for my part in that.

For map/reduce views and search indexes (either dreyfus/clouseau or nouveau) this is done having the path to the index in the url.

Inventing a third mechanism (as Ilya proposes) for this is not acceptable to me; our API is already complex enough.

If we want a facility to run a selector against a user-selected index, then I only see one way: /db/_design/design-doc/_find/index-name.

If we don't want that, then we need do nothing.

rnewson avatar Aug 01 '23 19:08 rnewson

If we don't want that, then we need do nothing.

I don't have that strong of opinion. My only strong opinion is to involve Rich and his team in the design.

iilyak avatar Aug 01 '23 20:08 iilyak

Agreed, would be good to hear from Rich et al

rnewson avatar Aug 01 '23 21:08 rnewson

Appreciating the difficulties with augmenting existing endpoints, I think it's worth considering adding "selector" and "fields" support to the existing _all_docs, view and search endpoints. This would be consistent with the _changes and _find endpoints and provides most of the benefits of a single-index _find query, albeit with a more explicit API.

It would also enable the filtering functionality for existing view/search indexes whereas Mango makes assumptions about the index key structure that hand-crafted indexes would need to replicate if _find we're to be supported for them.

On Tue, 1 Aug 2023, 23:26 Robert Newson, @.***> wrote:

Agreed, would be good to hear from Rich et al

— Reply to this email directly, view it on GitHub https://github.com/apache/couchdb/pull/4710#issuecomment-1661123039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAX365O3ZXNF4RRRIJZHXLXTFYBHANCNFSM6AAAAAA25IW4QM . You are receiving this because you were mentioned.Message ID: @.***>

willholley avatar Aug 02 '23 07:08 willholley

oooh I like that.

rnewson avatar Aug 02 '23 08:08 rnewson

one further note here. when, in the future, you pass a selector to _view or _search we should send a 400 Bad Request if mango would do a startkey/endkey that covers the entire index (for _view that's when it does "start_key": [], "end_key": ["<MAX>"],.)

rnewson avatar Aug 22 '23 09:08 rnewson

Okay, thanks for the note @rnewson.

pgj avatar Aug 22 '23 09:08 pgj

perhaps it's more general, the goal being that query execution through these new options is always efficient (at least relative to reading the entire database or index). This is an opportunity to reject things that we previously had to accept.

rnewson avatar Aug 22 '23 09:08 rnewson

Apologies for the delay in commenting here, I missed it initially in vacation season.

Appreciating the difficulties with augmenting existing endpoints, I think it's worth considering adding "selector" and "fields" support to the existing _all_docs, view and search endpoints.

I can see the appeal of this suggestion as a neat way to direct mango-style queries to a specific index without the need to add additional endpoints. However, the problem I see is that the response structure for view-like things, searches and mango queries are all a bit different.

I'd be completely opposed to different response structures being returned depending on the request properties. For example IMO a call to _all_docs with a selector should definitely not return an array of "docs":[{…},…] like _find does because it does not conform with the normally expected pattern for _all_docs repsonses of "rows":[{…"doc":{…}},…]. That kind of response structure toggling makes using endpoints unwieldy and confusing for users and also makes them hard to describe accurately and correctly in tooling and documentation.

To go down that route I'd say the responses would need to be transformed (or maybe it is actually not transformed) from the underlying index response to retain the existing structures used by the endpoints being augmented. That might in turn result in a loss of fidelity for some things and although additional response properties could be added relatively painlessly a proliferation of additional response properties that are only populated under certain request conditions is also not very user friendly (e.g. adding a bookmark that is only populated with a selector but not on a normal _all_docs, or execution_stats etc).

Further I think consideration of the impact on the request options is also needed, like how combinations could (or could appear to) operate. For example: what would it mean to provide a selector to a view-like endpoint with include_docs=false? or request fields=["foo","bar"]&include_docs=true on a normal view-like request?

The more APIs are dual-purposed and additional options bolted onto existing endpoints the more potential combinations surface that seem to make sense to users, but are invalid in the implementation. Again this leads to APIs that are harder to document and use. It isn't great for usability to have to fathom a giant matrix of option combinations from the manual. Sure validation can be added, but even then potentially making multiple requests getting 400 responses before landing on a valid combination isn't a good experience.

ricellis avatar Sep 12 '23 11:09 ricellis

@ricellis this is excellent feedback. I agree with your point about the several dimensions of confusion that we could introduce here, and our existing querying and indexing system is already confused. I don't want to make it worse.

It is no secret that I think mango was a mistake in its entirety and this colours my response to any attempt to change it. I dislike it, I think, for objective reasons. It promises much and delivers little, and every user of it that accidentally ends up doing full database reads on every query because they, quite reasonably, believed they were using a query interface, is a user we have harmed.

It is in that spirit that I objected to the original idea of this PR, to add another request body option to mean "I really mean you must use this index" because use_index is only advisory, despite its confident-sounding name.

However, adding selector as an option to _view and _search has problems, as you've noted.

So, I think we need another way.

I very much dislike the path of _find at the db level. The server side is insufficiently powerful today and, quite likely, will be forever, it is inappropriate to suggest that something clever happens on the server side. All that happens is the least-worst index for your selector is chosen and the rest is brute-forced. The diligence required by users to ensure they hit an index is high, and frequently not met.

Rather than an additional request body parameter to /db/_find, perhaps we could add the index to the url path, but still end in _find? The request and response body stay in 'find' format (barring any fields that might be obsolete by the fact the index is known already).

When called as /$dbname/_design/$ddoc/_find/$indexname we would only execute the query if the selector matches the index's indexed fields. a 400 Bad Request is returned for anything else. No _all_docs fallback. Quite what we would decide on middle cases where a result can be served from the chosen index with some post-query filtering I'm not sure. I'd prefer to not allow that either.

In this formulation I think 'execution stats' might not be emitted, since the number of rows returned would always match the number examined (as they've always done with _view).

Smaller decisions like either prohibiting use_index in the request body or accepting it if it matches (but definitely 400 Bad Request if present and different) also need to be made.

rnewson avatar Sep 12 '23 13:09 rnewson

Rather than an additional request body parameter to /db/_find, perhaps we could add the index to the url path, but still end in _find? The request and response body stay in 'find' format (barring any fields that might be obsolete by the fact the index is known already).

This could work, though only for indexes created via the _index endpoint given we need the index definition in a format that allows _find to reason with the schema. For JS/other indexes we could return a 400 status code as we would for an invalid index.

In this formulation I think 'execution stats' might not be emitted, since the number of rows returned would always match the number examined (as they've always done with _view).

I don't think this is true unless you're proposing to limit _find to start_key/end_key derivation only, which would represent a large step down in functionality from the database-level _find?

willholley avatar Sep 12 '23 15:09 willholley

definitely _find will only work for indexes created with _index for the reasons you state.

I think you're right that, in practice, the index-level _find will have to allow the 'middle' cases. e.g an mr-backed index on two fields where the selector is only on the first field, so a wider start/endkey (["foo"] to ["foo",{}]) and then filter out the non-matches.

That's still a lot better than db-level _finds failure modes. We wouldn't allow a startkey/endkey that covers the entire view (or the *:* equivalent for search, though that at least doesn't return all the hits).

rnewson avatar Sep 12 '23 17:09 rnewson