elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

Stricter failure handling in `TransportGetSnapshotsAction`

Open DaveCTurner opened this issue 10 months ago • 15 comments

Today if there's a failure during a multi-repo get-snapshots request then we record a per-repository failure but allow the rest of the request to proceed. This is trappy for clients, it means that they must always remember to check the failures response field or else risk missing some results. It's also a pain for the implementation because it means we have to collect the per-repository results separately first before adding them to the final results set just in case the last one triggers a failure.

This commit drops this leniency and bubbles all failures straight up to the top level.

DaveCTurner avatar Apr 08 '24 06:04 DaveCTurner

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine avatar Apr 08 '24 06:04 elasticsearchmachine

Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

elasticsearchmachine avatar Apr 08 '24 06:04 elasticsearchmachine

I think this is a design bug which was an artefact of the original multi-repo implementation in #42090 which we've preserved throughout future changes. AFAICT there's no good reason for being so lenient here. Moreover today's behaviour is an obstacle to improving the performance (https://github.com/elastic/elasticsearch/issues/95345) and memory usage (https://github.com/elastic/elasticsearch/issues/104607) of this API in clusters with high snapshot counts.

To be precise, this changes a few cases from partial failures into total failures:

  1. the user requests snapshots in some collection of repositories which includes a specific named repository that is not registered with the cluster. Previously we'd list all the other repositories and mark the specific named repository with a RepositoryMissingException. Now the whole request fails with a RepositoryMissingException. This makes sense to me, the user should use the get-repositories API to determine what repositories are registered with the cluster rather than using this API.

  2. one of the repositories we're listing is so broken that we cannot read its RepositoryData. Again, previously we'd list all the other repositories and skip the broken one. Now the whole request fails. Again, this makes sense to me, the user can exclude the broken repository in the request if desired.

  3. one of the target snapshots has an unreadable SnapshotInfo blob. Previously if ?ignore_unavailable=false we'd skip the whole repository and return incomplete results, whereas now we fail the whole request with a SnapshotMissingException. If ?ignore_unavailable=true we skip the missing snapshot as before.

I'm marking this as >breaking since it's changing the behaviour of certain failure cases, even though I think we should go ahead with it. Also marking it as team-discuss so we remember to talk about it. If the team is ok with the idea then I'll formally propose it as a breaking change.

DaveCTurner avatar Apr 08 '24 07:04 DaveCTurner

We (the @elastic/es-distributed coordination subteam) discussed this today and broadly agreed with this direction, pending confirmation from Kibana and Cloud callers that this won't cause them any problems, and then agreement with the breaking changes committee that this change is acceptable.

DaveCTurner avatar Apr 09 '24 13:04 DaveCTurner

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

elasticsearchmachine avatar Apr 09 '24 13:04 elasticsearchmachine

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

Bad bot 🤖

DaveCTurner avatar Apr 09 '24 14:04 DaveCTurner

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

elasticsearchmachine avatar Apr 17 '24 17:04 elasticsearchmachine

To clarify, today a partial failure (with ?ignore_unavailable=false) when retrieving snapshots from multiple repositories looks like this:

200 OK
{
  "snapshots" : [
    {
      "snapshot" : "cfmqmfrheg",
      "uuid" : "Wb7b0_zVRH207wPtuMOy-w",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.205Z",
      "start_time_in_millis" : 1713779076205,
      "end_time" : "2024-04-22T09:44:36.256Z",
      "end_time_in_millis" : 1713779076256,
      "duration" : "51ms",
      "duration_in_millis" : 51,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "lhbqjriekp",
      "uuid" : "R5_AuiaHSEi4J0X2RvTBSw",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.346Z",
      "start_time_in_millis" : 1713779076346,
      "end_time" : "2024-04-22T09:44:36.370Z",
      "end_time_in_millis" : 1713779076370,
      "duration" : "24ms",
      "duration_in_millis" : 24,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "kqxbwkttzr",
      "uuid" : "0xa8mF3tQDaBe77-56jvqg",
      "repository" : "repo0",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.462Z",
      "start_time_in_millis" : 1713779076462,
      "end_time" : "2024-04-22T09:44:36.495Z",
      "end_time_in_millis" : 1713779076495,
      "duration" : "33ms",
      "duration_in_millis" : 33,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "jfljnrqutw",
      "uuid" : "G7zg1e6JQDKRTib6wO4erQ",
      "repository" : "repo1",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.674Z",
      "start_time_in_millis" : 1713779076674,
      "end_time" : "2024-04-22T09:44:36.707Z",
      "end_time_in_millis" : 1713779076707,
      "duration" : "33ms",
      "duration_in_millis" : 33,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    },
    {
      "snapshot" : "dnmxscesew",
      "uuid" : "_ZIcQBRaSMinWxa0pX8llA",
      "repository" : "repo1",
      "version_id" : 8505000,
      "version" : "8505000",
      "indices" : [
        "test-idx"
      ],
      "data_streams" : [ ],
      "include_global_state" : true,
      "state" : "SUCCESS",
      "start_time" : "2024-04-22T09:44:36.791Z",
      "start_time_in_millis" : 1713779076791,
      "end_time" : "2024-04-22T09:44:36.807Z",
      "end_time_in_millis" : 1713779076807,
      "duration" : "16ms",
      "duration_in_millis" : 16,
      "failures" : [ ],
      "shards" : {
        "total" : 4,
        "failed" : 0,
        "successful" : 4
      },
      "feature_states" : [ ]
    }
  ],
  "failures" : {
    "repo3" : {
      "type" : "snapshot_missing_exception",
      "reason" : "[repo3:jfljnrqutw] is missing"
    },
    "repo4" : {
      "type" : "snapshot_missing_exception",
      "reason" : "[repo4:lhbqjriekp] is missing"
    }
  },
  "total" : 5,
  "remaining" : 0
}

Note that we get no results from repo3 or repo4 in this case. With this change, those snapshot_missing_exception exceptions will be pulled up to the top level:

404 Not Found
{
  "error" : {
    "root_cause" : [
      {
        "type" : "snapshot_missing_exception",
        "reason" : "[repo3:jfljnrqutw] is missing"
      }
    ],
    "type" : "snapshot_missing_exception",
    "reason" : "[repo3:jfljnrqutw] is missing",
    "caused_by" : {
      "type" : "some_inner_exception",
      "reason" : "File [foo/bar/baz.dat] not found"
    }
  },
  "status" : 404
}

Note that this is already the behaviour if the request targets a single repository, it's only the multi-repo case that supports partial failures.

DaveCTurner avatar Apr 22 '24 09:04 DaveCTurner

Thanks a lot for the example, @DaveCTurner! I had a look at the Kibana code where the snapshots list is being retrieved for the UI and it uses the parameter ignore_unavailable: true, does this mean the response won't fail if some repos are missing?

yuliacech avatar Apr 26 '24 10:04 yuliacech

Ah that sounds promising thanks @yuliacech. How does Kibana determine which repositories to request? Does it make a list of exact repository names or does it use _all or *?

DaveCTurner avatar Apr 26 '24 10:04 DaveCTurner

For the UI we first get the list of existing repos with Get repos request with _all and then the user can select some repositories and for those we get the snapshots (see the filter button in the screenshot below). If the user doesn't select a specific repo, we use _all for the repos name on the Get snapshots request. Screenshot 2024-04-26 at 12 26 18

yuliacech avatar Apr 26 '24 10:04 yuliacech

Thanks, that means there would be a very slight change in behaviour here: if a repository is removed from the cluster in between listing the repos and the user selecting the removed repo (plus at least one other repo) then with this PR we'd return a RepositoryMissingException whereas previously they'd get a list of all the snapshots in the other repositories. IMO that's preferable, it directs the user back to select a different collection of repositories, but we could also go back to something more like the older behaviour without losing most of the benefits of this PR. WDYT?

DaveCTurner avatar Apr 26 '24 10:04 DaveCTurner

I think the UI should be able to handle the RepositoryMissingException by default and this is is probably very unlikely to happen, since the list of repos is re-fetched from ES on every page load. So I don't think we need to keep the old behaviour in this case.

yuliacech avatar Apr 26 '24 12:04 yuliacech

So does it mean that the Get snapshot request won't have failures property at all anymore, or are there still other cases where this property is used?

yuliacech avatar Apr 26 '24 12:04 yuliacech

That's correct, you'll only see the failures field in a mixed-cluster situation, and we'll drop it entirely in v9.

DaveCTurner avatar Apr 26 '24 13:04 DaveCTurner

Hi @DaveCTurner, I've updated the changelog YAML for you.

elasticsearchmachine avatar Jul 03 '24 08:07 elasticsearchmachine