scylla-manager Restoring a 3.1 backup with 3.2 fails, with files from the bucket appearing to be missing

Issue description

[x] This issue is a regression.
[ ] It is unknown if this issue is a regression.

At first, while using manager 3.1.2, we create a backupm which was successful:

< t:2023-08-16 19:54:15,367 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c 4105638a-e5a7-4c68-b8c2-614904a59131 --keyspace keyspace1  --location s3:manager-backup-tests-us-east-1  --cron '56 19 * * *' " finished with status 0
...
< t:2023-08-16 19:56:24,300 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool  -c 4105638a-e5a7-4c68-b8c2-614904a59131 progress backup/6d118b0f-2652-4258-a797-9a191b0b26f2"...
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            ec1624fd-3c6e-11ee-bd30-02b041669e61
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         DONE
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     16 Aug 23 19:56:00 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       16 Aug 23 19:56:12 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       12s
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       100%
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20230816195600UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:    
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - us-east
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╭─────────────┬──────────┬──────────┬──────────┬──────────────┬────────╮
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ Host        │ Progress │     Size │  Success │ Deduplicated │ Failed │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ├─────────────┼──────────┼──────────┼──────────┼──────────────┼────────┤
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.0.108 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.151 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.249 │     100% │ 283.377M │ 283.377M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╰─────────────┴──────────┴──────────┴──────────┴──────────────┴────────╯

Afterwards, we upgrade to 3.2.0:

< t:2023-08-16 20:01:12,692 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool version"...
< t:2023-08-16 20:01:12,770 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Client version: 3.2.0-0.20230815.9db309c5
< t:2023-08-16 20:01:12,772 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Server version: 3.2.0-0.20230815.9db309c5

And attempt to restore the aforementioned backup:

< t:2023-08-16 20:05:57,707 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c 4105638a-e5a7-4c68-b8c2-614904a59131 --restore-tables --location s3:manager-backup-tests-us-east-1  --snapshot-tag sm_20230816195600UTC" finished with status 0
< t:2023-08-16 20:05:57,708 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/03db246a-0533-4a91-9079-9ecdf41884f2

But the restore fails, stating that there are missing files in the bucket:

< t:2023-08-16 20:06:35,842 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool  -c 4105638a-e5a7-4c68-b8c2-614904a59131 progress restore/03db246a-0533-4a91-9079-9ecdf41884f2" finished with status 0
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Restore progress
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Run:              500d5a78-3c70-11ee-8668-02b041669e61
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Status:           ERROR (restoring backed-up data)
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Cause:            not restored bundles [3 2]: wait on rclone job, id: 1692216016, host
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG >  10.12.0.108: job error (1692216016): failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Data.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Statistics.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Summary.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-TOC.txt failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-CRC.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Data.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Digest.crc32 failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Filter.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Index.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Scylla.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Statistics.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Summary.db failed to open source object: object not found; backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-TOC.txt failed to open source object: object not found
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Start time:       16 Aug 23 20:05:57 UTC
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > End time: 16 Aug 23 20:06:05 UTC
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Duration: 8s
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Progress: 0% | 0%
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20230816195600UTC
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > 
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > ╭───────────┬──────────┬──────────┬─────────┬────────────┬────────╮
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > │ Keyspace  │ Progress │     Size │ Success │ Downloaded │ Failed │
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > ├───────────┼──────────┼──────────┼─────────┼────────────┼────────┤
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > │ keyspace2 │  0% | 0% │ 849.762M │       0 │          0 │      0 │
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > │ keyspace1 │  0% | 0% │ 849.861M │       0 │          0 │      0 │
< t:2023-08-16 20:06:35,843 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > ╰───────────┴──────────┴──────────┴─────────┴────────────┴────────╯

To summerize, the files that the restore task reports as missing are:

backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Data.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Statistics.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-Summary.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-2-big-TOC.txt
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-CRC.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Data.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Digest.crc32
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Filter.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Index.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Scylla.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Statistics.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-Summary.db
backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/me-3-big-TOC.txt

Which do seem to be missing from the actual bucket: https://s3.console.aws.amazon.com/s3/buckets/manager-backup-tests-us-east-1?region=us-east-1&prefix=backup/sst/cluster/0fa19c7a-d21a-4bff-9f19-80c33dd2ae15/dc/us-east/node/1719707d-6da3-4cc6-b2b8-a8cef613b028/keyspace/keyspace2/table/standard1/c34e94003c6e11eeaf7f8fd4eda90fdb/&showversions=false Screenshot from 2023-08-30 05-34-21

Impact

Restoring older backups would not be possilbe

Installation details

Kernel Version: 5.15.0-1034-aws Scylla version (or git commit hash): 5.0.13-20230423.a0ca8abe4 with build-id 1196f46d1be44927c95ac67703df3ab2ef98a332

Initial manager version: 3.1.2-0.20230704.bd349aa4 Manager version after upgrade: 3.2.0-0.20230815.9db309c5

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

manager-upgrade-manager--db-node-c60a1216-3 (44.197.244.212 | 10.12.1.151) (shards: 2)
manager-upgrade-manager--db-node-c60a1216-2 (44.203.9.9 | 10.12.0.108) (shards: 2)
manager-upgrade-manager--db-node-c60a1216-1 (44.195.38.80 | 10.12.1.249) (shards: 2)

OS / Image: ami-0659419f07e694eaf (aws: undefined_region)

Test: ubuntu20-upgrade-test Test id: c60a1216-9e68-4134-a42a-a34ce6a71a45 Test name: manager-3.2/ubuntu20-upgrade-test Test config file(s):

manager-upgrade.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor c60a1216-9e68-4134-a42a-a34ce6a71a45
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs c60a1216-9e68-4134-a42a-a34ce6a71a45

Logs:

db-cluster-c60a1216.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/db-cluster-c60a1216.tar.gz
sct-runner-events-c60a1216.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/sct-runner-events-c60a1216.tar.gz
sct-c60a1216.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/sct-c60a1216.log.tar.gz
loader-set-c60a1216.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/loader-set-c60a1216.tar.gz
monitor-set-c60a1216.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c60a1216-9e68-4134-a42a-a34ce6a71a45/20230816_200800/monitor-set-c60a1216.tar.gz

Jenkins job URL Argus

Aug 30 '23 02:08 ShlomiBalalis

Scylla 5.0.13 is not supported anymore.
So if the files are not in the bucket, where are they? (who deleted them, or did not back them up properly in the 1st place) ?

Aug 30 '23 09:08 mykaul

< t:2023-08-16 19:54:15,367 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c 4105638a-e5a7-4c68-b8c2-614904a59131 --keyspace keyspace1  --location s3:manager-backup-tests-us-east-1  --cron '56 19 * * *' " finished with status 0

Well, only keyspace1 is being backed-up, so that's not surprising that files from keyspace2 are missing. I'm investigating why would SM try to restore keyspace2 if it wasn't the part of the backup.

Aug 30 '23 11:08 Michal-Leszczynski

What is interesting here, is that the backup bucket contains two different clusters 4105638a-e5a7-4c68-b8c2-614904a59131, 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 and both of them have a backup with the same snapshot tag sm_20230816195600UTC. This is possible only if ~~there are 2 instances of SM running backup~~ two different clusters are being backed-up into the same remote location at the very same time (up to 1 second backup start time diff).

In this test scenario, provided SM logs shows that SM backed up cluster 4105638a-e5a7-4c68-b8c2-614904a59131 with snapshot tag sm_20230816195600UTC, but the following restore tried to restore backup from cluster 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 with the same snapshot tag.

But even though, I would expect that restore would end successfully with the result of restoring data from this second cluster, so it's still puzzling why files from this second cluster are missing.

@ShlomiBalalis could you verify, if what I'm saying make sense? Moreover, please describe what tests were conducted on this second cluster (0fa19c7a-d21a-4bff-9f19-80c33dd2ae15), because perhaps those tests scenarios are actually responsible for deleting those files.

Aug 30 '23 13:08 Michal-Leszczynski

@ShlomiBalalis ping

Sep 07 '23 14:09 Michal-Leszczynski

@ShlomiBalalis

Sep 24 '23 10:09 fgelcer

@ShlomiBalalis could you please reply to the comment https://github.com/scylladb/scylla-manager/issues/3548#issuecomment-1699124738 ? We need to understand if there is really a bug in manager and if we should include it into 3.2.2.

Sep 27 '23 07:09 karol-kokoszka

What is interesting here, is that the backup bucket contains two different clusters 4105638a-e5a7-4c68-b8c2-614904a59131, 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 and both of them have a backup with the same snapshot tag sm_20230816195600UTC. This is possible only if ~there are 2 instances of SM running backup~ two different clusters are being backed-up into the same remote location at the very same time (up to 1 second backup start time diff).

In this test scenario, provided SM logs shows that SM backed up cluster 4105638a-e5a7-4c68-b8c2-614904a59131 with snapshot tag sm_20230816195600UTC, but the following restore tried to restore backup from cluster 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 with the same snapshot tag.

But even though, I would expect that restore would end successfully with the result of restoring data from this second cluster, so it's still puzzling why files from this second cluster are missing.

@ShlomiBalalis could you verify, if what I'm saying make sense? Moreover, please describe what tests were conducted on this second cluster (0fa19c7a-d21a-4bff-9f19-80c33dd2ae15), because perhaps those tests scenarios are actually responsible for deleting those files.

I have no idea where did the second cluster come from. We only executed one cluster add in this test, which gave us 4105638a-e5a7-4c68-b8c2-614904a59131:

< t:2023-08-16 19:50:38,371 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool cluster add --host=10.12.1.249  --name=cluster_under_test --auth-token c60a1216-9e68-4134-a42a-a34ce6a71a45" finished with status 0
< t:2023-08-16 19:50:38,371 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: 4105638a-e5a7-4c68-b8c2-614904a59131

I see that the second cluster is only mentioned with regard to the restore/03db246a-0533-4a91-9079-9ecdf41884f2 task, but it was created using the 4105638a-e5a7-4c68-b8c2-614904a59131 cluster:

< t:2023-08-16 20:05:57,707 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c 4105638a-e5a7-4c68-b8c2-614904a59131 --restore-tables --location s3:manager-backup-tests-us-east-1  --snapshot-tag sm_20230816195600UTC" finished with status 0
< t:2023-08-16 20:05:57,708 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: restore/03db246a-0533-4a91-9079-9ecdf41884f2

The snapshot tag that the restore task uses, sm_20230816195600UTC, was created by the backup/6d118b0f-2652-4258-a797-9a191b0b26f2 task, which was also created using 4105638a-e5a7-4c68-b8c2-614904a59131:

< t:2023-08-16 19:54:15,367 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool backup -c 4105638a-e5a7-4c68-b8c2-614904a59131 --keyspace keyspace1  --location s3:manager-backup-tests-us-east-1  --cron '56 19 * * *' " finished with status 0
< t:2023-08-16 19:54:15,367 f:cli.py          l:1126 c:sdcm.mgmt.cli        p:DEBUG > sctool output: backup/6d118b0f-2652-4258-a797-9a191b0b26f2
...
< t:2023-08-16 19:56:24,300 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo sctool  -c 4105638a-e5a7-4c68-b8c2-614904a59131 progress backup/6d118b0f-2652-4258-a797-9a191b0b26f2"...
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            ec1624fd-3c6e-11ee-bd30-02b041669e61
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         DONE
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     16 Aug 23 19:56:00 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       16 Aug 23 19:56:12 UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       12s
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       100%
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20230816195600UTC
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:    
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - us-east
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╭─────────────┬──────────┬──────────┬──────────┬──────────────┬────────╮
< t:2023-08-16 19:56:24,451 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ Host        │ Progress │     Size │  Success │ Deduplicated │ Failed │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ├─────────────┼──────────┼──────────┼──────────┼──────────────┼────────┤
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.0.108 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.151 │     100% │ 283.379M │ 283.379M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.12.1.249 │     100% │ 283.377M │ 283.377M │            0 │      0 │
< t:2023-08-16 19:56:24,452 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╰─────────────┴──────────┴──────────┴──────────┴──────────────┴────────╯

So I really don't know how did 0fa19c7a-d21a-4bff-9f19-80c33dd2ae15 came into play.

Sep 27 '23 14:09 ShlomiBalalis

So from my perspective, this looks like there were concurrent test execution or something similar. @ShlomiBalalis was this issue ever reproduced? If not, could we rerun it and see if it still happens?

Oct 10 '23 09:10 Michal-Leszczynski

@ShlomiBalalis ping

Mar 01 '24 11:03 Michal-Leszczynski

@mikliapko is this something that you could take care of? I mean investigating the reasons for two clusters being there.

Mar 28 '24 11:03 Michal-Leszczynski

scylla-manager scylla-manager copied to clipboard

Restoring a 3.1 backup with 3.2 fails, with files from the bucket appearing to be missing

Issue description

Impact

Installation details

Logs:

scylla-manager
scylla-manager copied to clipboard