elasticsearch
elasticsearch copied to clipboard
Stricter failure handling in `TransportGetSnapshotsAction`
Today if there's a failure during a multi-repo get-snapshots request
then we record a per-repository failure but allow the rest of the
request to proceed. This is trappy for clients, it means that they must
always remember to check the failures
response field or else risk
missing some results. It's also a pain for the implementation because it
means we have to collect the per-repository results separately first
before adding them to the final results set just in case the last one
triggers a failure.
This commit drops this leniency and bubbles all failures straight up to the top level.
Pinging @elastic/es-distributed (Team:Distributed)
Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled >breaking
, you need to update the changelog YAML to fill out the extended information sections.
I think this is a design bug which was an artefact of the original multi-repo implementation in #42090 which we've preserved throughout future changes. AFAICT there's no good reason for being so lenient here. Moreover today's behaviour is an obstacle to improving the performance (https://github.com/elastic/elasticsearch/issues/95345) and memory usage (https://github.com/elastic/elasticsearch/issues/104607) of this API in clusters with high snapshot counts.
To be precise, this changes a few cases from partial failures into total failures:
-
the user requests snapshots in some collection of repositories which includes a specific named repository that is not registered with the cluster. Previously we'd list all the other repositories and mark the specific named repository with a
RepositoryMissingException
. Now the whole request fails with aRepositoryMissingException
. This makes sense to me, the user should use the get-repositories API to determine what repositories are registered with the cluster rather than using this API. -
one of the repositories we're listing is so broken that we cannot read its
RepositoryData
. Again, previously we'd list all the other repositories and skip the broken one. Now the whole request fails. Again, this makes sense to me, the user can exclude the broken repository in the request if desired. -
one of the target snapshots has an unreadable
SnapshotInfo
blob. Previously if?ignore_unavailable=false
we'd skip the whole repository and return incomplete results, whereas now we fail the whole request with aSnapshotMissingException
. If?ignore_unavailable=true
we skip the missing snapshot as before.
I'm marking this as >breaking
since it's changing the behaviour of certain failure cases, even though I think we should go ahead with it. Also marking it as team-discuss
so we remember to talk about it. If the team is ok with the idea then I'll formally propose it as a breaking change.
We (the @elastic/es-distributed coordination subteam) discussed this today and broadly agreed with this direction, pending confirmation from Kibana and Cloud callers that this won't cause them any problems, and then agreement with the breaking changes committee that this change is acceptable.
Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking
, you need to update the changelog YAML to fill out the extended information sections.
Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled
>breaking
, you need to update the changelog YAML to fill out the extended information sections.
Bad bot 🤖
Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking
, you need to update the changelog YAML to fill out the extended information sections.
To clarify, today a partial failure (with ?ignore_unavailable=false
) when retrieving snapshots from multiple repositories looks like this:
200 OK
{
"snapshots" : [
{
"snapshot" : "cfmqmfrheg",
"uuid" : "Wb7b0_zVRH207wPtuMOy-w",
"repository" : "repo0",
"version_id" : 8505000,
"version" : "8505000",
"indices" : [
"test-idx"
],
"data_streams" : [ ],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2024-04-22T09:44:36.205Z",
"start_time_in_millis" : 1713779076205,
"end_time" : "2024-04-22T09:44:36.256Z",
"end_time_in_millis" : 1713779076256,
"duration" : "51ms",
"duration_in_millis" : 51,
"failures" : [ ],
"shards" : {
"total" : 4,
"failed" : 0,
"successful" : 4
},
"feature_states" : [ ]
},
{
"snapshot" : "lhbqjriekp",
"uuid" : "R5_AuiaHSEi4J0X2RvTBSw",
"repository" : "repo0",
"version_id" : 8505000,
"version" : "8505000",
"indices" : [
"test-idx"
],
"data_streams" : [ ],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2024-04-22T09:44:36.346Z",
"start_time_in_millis" : 1713779076346,
"end_time" : "2024-04-22T09:44:36.370Z",
"end_time_in_millis" : 1713779076370,
"duration" : "24ms",
"duration_in_millis" : 24,
"failures" : [ ],
"shards" : {
"total" : 4,
"failed" : 0,
"successful" : 4
},
"feature_states" : [ ]
},
{
"snapshot" : "kqxbwkttzr",
"uuid" : "0xa8mF3tQDaBe77-56jvqg",
"repository" : "repo0",
"version_id" : 8505000,
"version" : "8505000",
"indices" : [
"test-idx"
],
"data_streams" : [ ],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2024-04-22T09:44:36.462Z",
"start_time_in_millis" : 1713779076462,
"end_time" : "2024-04-22T09:44:36.495Z",
"end_time_in_millis" : 1713779076495,
"duration" : "33ms",
"duration_in_millis" : 33,
"failures" : [ ],
"shards" : {
"total" : 4,
"failed" : 0,
"successful" : 4
},
"feature_states" : [ ]
},
{
"snapshot" : "jfljnrqutw",
"uuid" : "G7zg1e6JQDKRTib6wO4erQ",
"repository" : "repo1",
"version_id" : 8505000,
"version" : "8505000",
"indices" : [
"test-idx"
],
"data_streams" : [ ],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2024-04-22T09:44:36.674Z",
"start_time_in_millis" : 1713779076674,
"end_time" : "2024-04-22T09:44:36.707Z",
"end_time_in_millis" : 1713779076707,
"duration" : "33ms",
"duration_in_millis" : 33,
"failures" : [ ],
"shards" : {
"total" : 4,
"failed" : 0,
"successful" : 4
},
"feature_states" : [ ]
},
{
"snapshot" : "dnmxscesew",
"uuid" : "_ZIcQBRaSMinWxa0pX8llA",
"repository" : "repo1",
"version_id" : 8505000,
"version" : "8505000",
"indices" : [
"test-idx"
],
"data_streams" : [ ],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2024-04-22T09:44:36.791Z",
"start_time_in_millis" : 1713779076791,
"end_time" : "2024-04-22T09:44:36.807Z",
"end_time_in_millis" : 1713779076807,
"duration" : "16ms",
"duration_in_millis" : 16,
"failures" : [ ],
"shards" : {
"total" : 4,
"failed" : 0,
"successful" : 4
},
"feature_states" : [ ]
}
],
"failures" : {
"repo3" : {
"type" : "snapshot_missing_exception",
"reason" : "[repo3:jfljnrqutw] is missing"
},
"repo4" : {
"type" : "snapshot_missing_exception",
"reason" : "[repo4:lhbqjriekp] is missing"
}
},
"total" : 5,
"remaining" : 0
}
Note that we get no results from repo3
or repo4
in this case. With this change, those snapshot_missing_exception
exceptions will be pulled up to the top level:
404 Not Found
{
"error" : {
"root_cause" : [
{
"type" : "snapshot_missing_exception",
"reason" : "[repo3:jfljnrqutw] is missing"
}
],
"type" : "snapshot_missing_exception",
"reason" : "[repo3:jfljnrqutw] is missing",
"caused_by" : {
"type" : "some_inner_exception",
"reason" : "File [foo/bar/baz.dat] not found"
}
},
"status" : 404
}
Note that this is already the behaviour if the request targets a single repository, it's only the multi-repo case that supports partial failures.
Thanks a lot for the example, @DaveCTurner! I had a look at the Kibana code where the snapshots list is being retrieved for the UI and it uses the parameter ignore_unavailable: true
, does this mean the response won't fail if some repos are missing?
Ah that sounds promising thanks @yuliacech. How does Kibana determine which repositories to request? Does it make a list of exact repository names or does it use _all
or *
?
For the UI we first get the list of existing repos with Get repos request with _all
and then the user can select some repositories and for those we get the snapshots (see the filter button in the screenshot below). If the user doesn't select a specific repo, we use _all
for the repos name on the Get snapshots request.
Thanks, that means there would be a very slight change in behaviour here: if a repository is removed from the cluster in between listing the repos and the user selecting the removed repo (plus at least one other repo) then with this PR we'd return a RepositoryMissingException
whereas previously they'd get a list of all the snapshots in the other repositories. IMO that's preferable, it directs the user back to select a different collection of repositories, but we could also go back to something more like the older behaviour without losing most of the benefits of this PR. WDYT?
I think the UI should be able to handle the RepositoryMissingException
by default and this is is probably very unlikely to happen, since the list of repos is re-fetched from ES on every page load. So I don't think we need to keep the old behaviour in this case.
So does it mean that the Get snapshot request won't have failures
property at all anymore, or are there still other cases where this property is used?
That's correct, you'll only see the failures
field in a mixed-cluster situation, and we'll drop it entirely in v9.
Hi @DaveCTurner, I've updated the changelog YAML for you.