scylla-manager icon indicating copy to clipboard operation
scylla-manager copied to clipboard

sctool restore failed with error: "failed to open source object: object not found'

Open juliayakovlev opened this issue 1 year ago • 10 comments

Issue description

  • [ ] This issue is a regression.
  • [x] It is unknown if this issue is a regression.

MgmtRestore nemesis failed with error:

< t:2024-01-13 00:13:30,049 f:remote_base.py  l:521  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command 
"sudo sctool restore -c 526e048f-bb21-4f5c-a8d5-037023bf7467 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230702235739UTC"...

Jan 13 02:44:47 longevity-twcs-48h-2023-1-monitor-node-54645511-1 scylla-manager[13653]: {"L":"INFO","T":"2024-01-13T02:44:47.897Z","N":"scheduler.526e048f","M":
"Run ended with ERROR","task":"restore/5674514c-c882-4537-b66e-afc451552bde","status":"ERROR",
"cause":"not restored bundles [138]: restore batch: wait for job: job error (1705094012): failed to open source object: object not found","duration":"11m24.728567216s","_trace_id":"5slN6cw0Reaodl99ZUoP3A"}

Client version: 3.2.5-0.20231206.8b378dea Server version: 3.2.5-0.20231206.8b378dea

Impact

sctool restore failed. No other impact observes

How frequently does it reproduce?

Found this issue. Not sure if it the same / similar

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2023.1.4-20240112.12c616e7f0cf with build-id e7263a4aa92cf866b98cf680bd68d7198c9690c0

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-2023-1-db-node-54645511-6 (34.207.151.200 | 10.12.10.33) (shards: -1)
  • longevity-twcs-48h-2023-1-db-node-54645511-5 (54.227.90.172 | 10.12.11.162) (shards: -1)
  • longevity-twcs-48h-2023-1-db-node-54645511-4 (54.226.225.25 | 10.12.8.132) (shards: 7)
  • longevity-twcs-48h-2023-1-db-node-54645511-3 (3.85.108.8 | 10.12.11.110) (shards: 7)
  • longevity-twcs-48h-2023-1-db-node-54645511-2 (34.229.155.70 | 10.12.9.204) (shards: 7)
  • longevity-twcs-48h-2023-1-db-node-54645511-1 (34.234.63.67 | 10.12.10.112) (shards: 7)

OS / Image: ami-08b5f8ff1565ab9f0 (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: 54645511-775e-4d02-8fd8-35a38a4a2df8 Test name: enterprise-2023.1/longevity/longevity-twcs-48h-test Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 54645511-775e-4d02-8fd8-35a38a4a2df8
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 54645511-775e-4d02-8fd8-35a38a4a2df8

Logs:

Jenkins job URL Argus

juliayakovlev avatar Jan 17 '24 11:01 juliayakovlev

Found https://github.com/scylladb/scylladb/issues/16321. Not sure if it the same / similar

I don't think that's a similar issue. Mentioned issue had a problem with restoring schema multiple times on the same cluster which is not supported. I haven't seen this in this issue.

It looks like file me-138-big-Index.db is present in SM manifest, but it's missing in backup location and that causes restore to fail.

From SM logs it looks like the test scenario goes like this:

  • make backup
  • restore schema
  • restore tables

But the strange thing is that backup generates snapshot tag sm_20240112221504UTC, but both restores use snapshot tag sm_20230702235739UTC. Is this expected? Where does the snapshot tag used for restore comes from and is there a chance that this backup is broken (misses s3:manager-backup-tests-permanent-snapshots-us-east-1/backup/sst/cluster/0f0f556f-eb17-4012-b39c-f99a35828c04/dc/us-east/node/15430605-a376-4758-9205-014ab34ad5d5/keyspace/100gb_sizetiered_2022_2/table/standard1/07206f60192311eea6af23bef1a3e064/me-138-big-Index.db)?

Michal-Leszczynski avatar Jan 19 '24 14:01 Michal-Leszczynski

I validated that this file is indeed missing from the s3 dir, so it's either a problem with a test (using predefined backup instead of the fresh one) or just a problem with predefined backup that's not part of the test. @juliayakovlev can we close this issue?

Michal-Leszczynski avatar Jan 24 '24 09:01 Michal-Leszczynski

@ShlomiBalalis can you see that, please

juliayakovlev avatar Jan 24 '24 10:01 juliayakovlev

@juliayakovlev , @ShlomiBalalis - any updates?

mykaul avatar Jan 30 '24 08:01 mykaul

@juliayakovlev , @ShlomiBalalis - any updates?

@ShlomiBalalis can you advice, please?

juliayakovlev avatar Jan 30 '24 10:01 juliayakovlev

Hi! Sorry for the long silence Yes, the file is missing, but I can't say for certain if it was missing in the first place, ever since we created the backup, or somewhere down the road. There is no Lifecycle rule that would cause this file to be deleted, so if it was properly created in the first place, I don't know how it went missing. I'll try to find the logs of the original run to see if it will be of any help

I validated that this file is indeed missing from the s3 dir, so it's either a problem with a test (using predefined backup instead of the fresh one) or just a problem with predefined backup that's not part of the test. @juliayakovlev can we close this issue?

The file was created over six months ago as part of another test run. Would that be a problem?

ShlomiBalalis avatar Feb 05 '24 16:02 ShlomiBalalis

The file was created over six months ago as part of another test run. Would that be a problem?

SM should have no problem with restoring old backups.

Michal-Leszczynski avatar Feb 12 '24 13:02 Michal-Leszczynski

@ShlomiBalalis any news? It continues to fail.

juliayakovlev avatar Feb 15 '24 08:02 juliayakovlev

@ShlomiBalalis ping

Michal-Leszczynski avatar Mar 01 '24 10:03 Michal-Leszczynski

@mikliapko is this something that you could take care of? I mean validating if this is a problem with some incomplete, cached backup or is it an actual issue.

Michal-Leszczynski avatar Mar 28 '24 11:03 Michal-Leszczynski