ServiceFabric.BackupRestore icon indicating copy to clipboard operation
ServiceFabric.BackupRestore copied to clipboard

Restore using BlobStore looks in the wrong place

Open esbenbach opened this issue 6 years ago • 7 comments

I tried using the BlobStore implementation for backup/restore of an Actor service.

I created a backup using BeginCreateBackup(Microsoft.ServiceFabric.Data.BackupOption.Full) and the backup appears in the blobstorage account in servicefabricbackups/root/{PARTITIONID} as expected.

Then i asked my SF cluster to simular a partial dataloss: Start-ServiceFabricPartitionDataLoss -DataLossMode PartialDataLoss -PartitionId 637499bf-efc1-4ead-90c6-5edcfddb82a1 -ServiceName "fabric:/MyService/MyActor" which invokes the recovery procedure for the actor in order to restore it to a proper state.

Now the RetrieveScheduledBackupAsync(Guid servicePartitionId) method seems to be executed with the correct partition id. However the code here attempts to retrieve a "BlockBlobReference" in servicefabricbackups/root/Queue/{servicePartitionId}/. This does not exist, and if I remove the "Queue" part it is still pointing to a directory and not a blockblob so it can get that block. Reading the code, i am unable to figure out what it should be doing (it seems to be looking for a Guid of sorts).

It seems like the BlobStore implementation is not really working, alternatively i am doing something horribly wrong.

esbenbach avatar Nov 17 '17 13:11 esbenbach

So i dug a bit futher and found that the Queue file seems to store the backup id of a manually queued backup. Now in this case since we are talking about SF trying to recover itself, there is no such thing.

I changed the code a bit to get the latest backup metadata for the given service partition (based on the timestamp), this probably breaks behaviour quite a bit, but it works in the sense that a backup metadata object is returned.

Now the next issue is that once it has the metadata it tries to download the backup using the following code in BackupRestoreServiceOperations

string localBackupFolder = Path.Combine(service.Context.CodePackageActivationContext.WorkDirectory, Guid.NewGuid().ToString("N"));

foreach (var backupMetadata in backupList)
                {
                    string subFolder = Path.Combine(localBackupFolder, backupMetadata.TimeStampUtc.ToString("yyyyMMddhhmmss"));
                    await service.CentralBackupStore.DownloadBackupFolderAsync(backupMetadata.BackupId, subFolder, cancellationToken);
                }

So it places everything into a subfolder for each backup, then it proceeds to restore said backup using:

var restoreDescription = new RestoreDescription(localBackupFolder, RestorePolicy.Force);
await restoreCtx.RestoreAsync(restoreDescription, cancellationToken);

However, the localBackupFolder only has a timestamp subfolder, which it does not recognize and therefore it fails.

If instead I change the "download" to place the files in the localBackupFolder directly, it restores as it should.

Im guess the subfolder stuff relates to incremental backups, i am just not sure how that works, as for full backups there is only ever going to be 1 backup folder.

Could you shed some light on how this works, because apparently, it works for the file store implementation (though I havn't dug into the details here)?

esbenbach avatar Nov 20 '17 09:11 esbenbach

The way it works is like this:

  1. Create a Full back-up. This backup will get a unique ID.
  2. Optionally, create one or more Partial back-ups. All of these will get an ID as well.
  3. When you need to restore a backup, call BeginRestoreBackup on the target partition (replica) with info about the backup to restore, as its argument.
  4. What happens next, is that this backup is placed in a Queue (blob), because you cannot pass arguments when restoring a backup.
  5. During the data-loss state inside the service, it looks for a queued backup identifier using its partition. If it can find a back-up identifier it restores that.

I believe that what you're missing is the enqueue step. I suppose a feature can be added that enables restore of the last backup automatically during data loss. At this time there isn't.

Regarding multiple folders: when restoring a partial backup, you need all previous backup folders, all the way up to (and including) the latest full backup.

loekd avatar Nov 20 '17 19:11 loekd

Spot on, I intentionally do not have the Enqueue step because the restore operation is kicked off by service fabric on its own whenever a partial data loss occurs (i assume this can happen if a node crashes during replication or some such thing).

So basically what I am attempting is not supported. Got it :)

I am probably going to have to develop this type of support on my own, would you be welcoming pull requests related to such a feature? Or would i be better off just forking and making my "own thing"

esbenbach avatar Nov 21 '17 06:11 esbenbach

Always great to get good pull requests. Be advised though, that the SF team is working on a Backup-Application that will likely make this package obsolete in the future.

loekd avatar Nov 21 '17 10:11 loekd

@loekd Nice to hear that the SF team is working on this. Do you by chance know if the SF team has discussed that feature or provided any updates regarding its progress anywhere online that I can read up on? Did a quick search, but couldn't find anything that seemed to discuss it.

andrewdmoreno avatar Dec 14 '17 22:12 andrewdmoreno

@andrewdmoreno check out this video from ignite: https://www.youtube.com/watch?v=EO4BUkgHaD8 around 42 minutes.

loekd avatar Dec 18 '17 07:12 loekd

@andrewdmoreno The preview will be release early next year: https://github.com/Azure/service-fabric-issues/issues/730

ThiemeNL avatar Dec 27 '17 13:12 ThiemeNL