elasticsearch
elasticsearch copied to clipboard
[test-framework] Clean up a repository only if there's no snapshots in progress
Only delete the repository if there are no in progress snapshots to it. wipeSnapshots
gets called in an assertBusy
loop that checks that there's no in progress snapshot, so the repository will be eventually deleted.
Fixes #86890
Pinging @elastic/es-distributed (Team:Distributed)
I fail to understand how this fixes https://github.com/elastic/elasticsearch/issues/86890. The repository cannot be deleted since there's a searchable snapshot using it, and searchable snapshots should be deleted before in:
// Clean up searchable snapshots indices before deleting snapshots and repositories
if (hasXPack() && nodeVersions.first().onOrAfter(Version.V_7_8_0) && preserveSearchableSnapshotsIndicesUponCompletion() == false) {
wipeSearchableSnapshotsIndices();
}
Maybe there's some issue in wipeSearchableSnapshotsIndices()
? @arteam
Pinging @elastic/clients-team (Team:Clients)
@fcofdez You're totally right! I misinterpreted the error message: I thought snapshots couldn't get deleted due snapshot indices, not searchable snapshot indices. I will take a closer look at wipeSearchableSnapshotsIndices
@fcofdez
I've spent a lot of time on this task, and there are 2 separate issues. The test fails both because of searchable snapshots and regular snapshots. The fix in this PR does help, it resolves issues like this.
Jul 27 2022 at 10:47:04 PM FAILED 3.2 sec elasticsearch › :x-pack:plugin:ilm:qa:multi-node:javaRestTest elasticsearch-ci-immutable-windows-2012-r2-1658954370656474772
WINDOWSX64elastic+elasticsearch+7.17+multijob+platform-support-windowsGRADLE_TASK=checkPart2,os=windows-2012-r2CI7.17
org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:59969], URI [_snapshot/fpfajosvNoBnc], status line [HTTP/1.1 500 Internal Server Error]
{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"trying to modify or unregister repository [fpfajosvNoBnc] that is currently used (snapshot is in progress)"}],"type":"illegal_state_exception","reason":"trying to modify or unregister repository [fpfajosvNoBnc] that is currently used (snapshot is in progress)"},"status":500}
I can reproduce this error somewhat regularly on my machine. The error for searchable snapshots isn't reproducible for my at all. I kind of suspect that it fails only on Windows.
@arteam I've been taking another look into this and I think we might be hitting a race condition here:
ESRestTestCase#wipeSearchableSnapshotsIndices
deletes the mounted searchable snapshots and then we try to remove all the snapshots in order to delete the repository.
The problem seems to be that the index is rolled over after we've cleaned all the searchable snapshots. restored-.ds-logs-dhemcylega-2022.05.18-000002
is the index mounted as searchable snapshot preventing the repository removal, notice that it is the 2
generation of that index. I think that's the issue in this particular test.
Thanks for the insight @fcofdez! That's a really good catch! I tried to fix the racing by adding an additional call to clear the searchable snapshot indices before removing the repository. WDYT?
Thanks for looking into this @arteam, I wonder if we should look into the failing test and ensure that we don't rollover or that the rolled over index is not mounted as a searchable snapshot? I think that we should tackle the issue there instead of trying to change the cleaning logic, wdyt?
@elasticmachine update branch
@elasticmachine update branch
The searchable snapshots bit has been addressed by https://github.com/elastic/elasticsearch/pull/90075, this PR is still relevant since we still have failures of the inability to remove repositories due to pending snapshots.