scylla-manager
scylla-manager copied to clipboard
sctool restore failed with error: "failed to open source object: object not found'
Issue description
- [ ] This issue is a regression.
- [x] It is unknown if this issue is a regression.
MgmtRestore nemesis failed with error:
< t:2024-01-13 00:13:30,049 f:remote_base.py l:521 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command
"sudo sctool restore -c 526e048f-bb21-4f5c-a8d5-037023bf7467 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1 --snapshot-tag sm_20230702235739UTC"...
Jan 13 02:44:47 longevity-twcs-48h-2023-1-monitor-node-54645511-1 scylla-manager[13653]: {"L":"INFO","T":"2024-01-13T02:44:47.897Z","N":"scheduler.526e048f","M":
"Run ended with ERROR","task":"restore/5674514c-c882-4537-b66e-afc451552bde","status":"ERROR",
"cause":"not restored bundles [138]: restore batch: wait for job: job error (1705094012): failed to open source object: object not found","duration":"11m24.728567216s","_trace_id":"5slN6cw0Reaodl99ZUoP3A"}
Client version: 3.2.5-0.20231206.8b378dea
Server version: 3.2.5-0.20231206.8b378dea
Impact
sctool restore
failed.
No other impact observes
How frequently does it reproduce?
Found this issue. Not sure if it the same / similar
Installation details
Kernel Version: 5.15.0-1051-aws
Scylla version (or git commit hash): 2023.1.4-20240112.12c616e7f0cf
with build-id e7263a4aa92cf866b98cf680bd68d7198c9690c0
Cluster size: 4 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
- longevity-twcs-48h-2023-1-db-node-54645511-6 (34.207.151.200 | 10.12.10.33) (shards: -1)
- longevity-twcs-48h-2023-1-db-node-54645511-5 (54.227.90.172 | 10.12.11.162) (shards: -1)
- longevity-twcs-48h-2023-1-db-node-54645511-4 (54.226.225.25 | 10.12.8.132) (shards: 7)
- longevity-twcs-48h-2023-1-db-node-54645511-3 (3.85.108.8 | 10.12.11.110) (shards: 7)
- longevity-twcs-48h-2023-1-db-node-54645511-2 (34.229.155.70 | 10.12.9.204) (shards: 7)
- longevity-twcs-48h-2023-1-db-node-54645511-1 (34.234.63.67 | 10.12.10.112) (shards: 7)
OS / Image: ami-08b5f8ff1565ab9f0
(aws: undefined_region)
Test: longevity-twcs-48h-test
Test id: 54645511-775e-4d02-8fd8-35a38a4a2df8
Test name: enterprise-2023.1/longevity/longevity-twcs-48h-test
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 54645511-775e-4d02-8fd8-35a38a4a2df8
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 54645511-775e-4d02-8fd8-35a38a4a2df8
Logs:
- db-cluster-54645511.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/db-cluster-54645511.tar.gz
- sct-runner-events-54645511.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/sct-runner-events-54645511.tar.gz
- sct-54645511.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/sct-54645511.log.tar.gz
- loader-set-54645511.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/loader-set-54645511.tar.gz
- monitor-set-54645511.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/54645511-775e-4d02-8fd8-35a38a4a2df8/20240113_101418/monitor-set-54645511.tar.gz
Found https://github.com/scylladb/scylladb/issues/16321. Not sure if it the same / similar
I don't think that's a similar issue. Mentioned issue had a problem with restoring schema multiple times on the same cluster which is not supported. I haven't seen this in this issue.
It looks like file me-138-big-Index.db
is present in SM manifest, but it's missing in backup location and that causes restore to fail.
From SM logs it looks like the test scenario goes like this:
- make backup
- restore schema
- restore tables
But the strange thing is that backup generates snapshot tag sm_20240112221504UTC
, but both restores use snapshot tag sm_20230702235739UTC
. Is this expected? Where does the snapshot tag used for restore comes from and is there a chance that this backup is broken (misses s3:manager-backup-tests-permanent-snapshots-us-east-1/backup/sst/cluster/0f0f556f-eb17-4012-b39c-f99a35828c04/dc/us-east/node/15430605-a376-4758-9205-014ab34ad5d5/keyspace/100gb_sizetiered_2022_2/table/standard1/07206f60192311eea6af23bef1a3e064/me-138-big-Index.db
)?
I validated that this file is indeed missing from the s3 dir, so it's either a problem with a test (using predefined backup instead of the fresh one) or just a problem with predefined backup that's not part of the test. @juliayakovlev can we close this issue?
@ShlomiBalalis can you see that, please
@juliayakovlev , @ShlomiBalalis - any updates?
@juliayakovlev , @ShlomiBalalis - any updates?
@ShlomiBalalis can you advice, please?
Hi! Sorry for the long silence Yes, the file is missing, but I can't say for certain if it was missing in the first place, ever since we created the backup, or somewhere down the road. There is no Lifecycle rule that would cause this file to be deleted, so if it was properly created in the first place, I don't know how it went missing. I'll try to find the logs of the original run to see if it will be of any help
I validated that this file is indeed missing from the s3 dir, so it's either a problem with a test (using predefined backup instead of the fresh one) or just a problem with predefined backup that's not part of the test. @juliayakovlev can we close this issue?
The file was created over six months ago as part of another test run. Would that be a problem?
The file was created over six months ago as part of another test run. Would that be a problem?
SM should have no problem with restoring old backups.
@ShlomiBalalis any news? It continues to fail.
@ShlomiBalalis ping
@mikliapko is this something that you could take care of? I mean validating if this is a problem with some incomplete, cached backup or is it an actual issue.