restic icon indicating copy to clipboard operation
restic copied to clipboard

Add repair command

Open aawsome opened this issue 4 years ago • 18 comments

What does this PR change? What problem does it solve?

Allow users to recover from broken repositories/snapshots while still salvaging the sane parts of the repository/snapshot.

For given snapshots (selection identical to, e.g., forget) the command tries to read all trees and checks if the needed blobs are contained in the index. If blobs are missing or trees cannot be read, it will create new trees and snapshots which only miss these "defect" parts. Those newly generated snapshots can be used to recover needed data. Also, after removing the "defect" snapshots, prune is able to clean up the repo again.

While this command is able to cause data loss, special care is taken such that the default flags won't do any harm - in fact, users have to explicitly specify --dry-run=false --delete to loose data.

Output looks like:

./restic -r /home/thinkpad/repo.index-missingblob2/repo repair 

 note: --dry-run is set
-> repair will only show what it would do.

enter password for repository: 
repository b270637f opened successfully, password is correct
check and repair 1 snapshots
<Snapshot f22c6d3a of [/home/thinkpad/data] at 2020-07-09 11:18:26.501071439 +0200 CEST by alex@thinkpad>:
removed defect file '/home/thinkpad/data/test'
would have modified tree 705adc0d
would have modified tree 1ee0d0e1
would have modified tree 98c948be
would have modified tree 9d89d7fe
would have repaired snpshot f22c6d3a.
[0:00] 100.00%  1 / 1 snapshots

Depends on #2878 for the troubleshooting docu update.

Was the change discussed in an issue or in the forum before?

Closes #1759 Closes #1798 Closes #2334

Checklist

  • [x] I have read the Contribution Guidelines
  • [x] I have enabled maintainer edits for this PR
  • [ ] I have added tests for all changes in this PR
  • [x] I have added documentation for the changes (in the manual)
  • [x] There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • [x] I have run gofmt on the code in all commits
  • [x] All commit messages are formatted in the same style as the other commits in the repo
  • [x] I'm done, this Pull Request is ready for review

aawsome avatar Aug 05 '20 19:08 aawsome

Could we please discuss first how such a command should look like? How does this relate to the rewrite command in #2731 ? What if the prune command could handle certain types of repository damages? Which types of repair commands for a repository do we need and how to avoid cluttering the CLI?

That said, with the current flood of new pull requests which is growing much faster than we can handle them and the still huge pile of older ones waiting for a review, it will quite likely take some time until anyone of the maintainers has time to participate in that discussion.

This PR does not fix the underlying issue of #2334, it merely removes the symptoms (the broken file). But it's still not possible to restore at least the remaining parts of a file.

MichaelEischer avatar Aug 08 '20 15:08 MichaelEischer

I thought there would be some changes appearing in near future on either rewrite or prune, so I waited with an answer.

So, basically the point is that restic is missing such kind of functionality and IMO this is critical as cases has already been reported where users are struggling with broken repos which can only be fixed by forgetting and pruning whole snapshots. This is why I proposed this PR.

About the name and where to put such functionality we can and should of course discuss and I'm fine to have the discussion here. I agree that the functionality is pretty similar to the rewrite command proposed in #2731 and maybe adding some flags there would solve the issue similarly. I however don't think that this functionality should be included into prune: prune removes data and should start only if the repository is sane. This functionality is proposed to help users the possibly best way if their repository is no sane to put it back into a sane state. For now, we already do have a recover command which somehow stands beside the other restic commands and seems to be a "quick" solution for a problem that occured. So I used a new command in this "tradition".

I must admit that there is another reason why I used a new command repair and it is the situation you depicted, namely that I must take into account that it might take a long time before this PR is getting discussed or merged. In this situation, I really do want to have a patch that does not depend on another WIP patch (i.e. #2731) or is expected to need rebasing after some other PRs get merged into master. This is a patch that is able to help users who need a quick solution for a broken repo and therefore should stay easily mergeable. This can be best guaranteed by only adding new files to the source code.

This PR does not fix the underlying issue of #2334, it merely removes the symptoms (the broken file). But it's > still not possible to restore at least the remaining parts of a file.

If --append is given (defaults to ".repaired") this repair command will generate a " file.repaired" which contains the remaining parts of that file. Only if no parts are left to repair the file, it will delete the file (as shown in the example output where /home/thinkpad/data/test was a 1-blob file and that blob was missing in the repo).

aawsome avatar Aug 27 '20 21:08 aawsome

So, basically the point is that restic is missing such kind of functionality and IMO this is critical as cases has already been reported where users are struggling with broken repos which can only be fixed by forgetting and pruning whole snapshots. This is why I proposed this PR.

It's definitely useful to have a WIP PR to point people to to repair damaged repositories.

I agree that the functionality is pretty similar to the rewrite command proposed in #2731 and maybe adding some flags there would solve the issue similarly.

From an implementation perspective both commands are pretty similar so it would be useful to make the recursive tree rewriting part reusable for both commands. (Assuming we end up with two separate commands)

I however don't think that this functionality should be included into prune

With "handle certain types of repository damages" I had in mind that it should be possible to let the prune command tolerate (= be able to remove unused blobs/packs) repositories which e.g. miss a few data blobs. Of course, the prune command shouldn't start to modify snapshots.

If --append is given (defaults to ".repaired") this repair command will generate a " file.repaired" which contains the remaining parts of that file. Only if no parts are left to repair the file, it will delete the file (as shown in the example output where /home/thinkpad/data/test was a 1-blob file and that blob was missing in the repo).

I've missed the --append option and yes it helps a lot if blob sizes are missing, but I'd still consider this as fixing the symptoms. The underlying problem I had in mind is the repository format itself: it can't store the size of missing blobs without risking further damaged snapshots, but when rebuilding the index to fix this we loose the information about the position of the data blobs.

MichaelEischer avatar Aug 28 '20 20:08 MichaelEischer

With "handle certain types of repository damages" I had in mind that it should be possible to let the prune command tolerate (= be able to remove unused blobs/packs) repositories which e.g. miss a few data blobs. Of course, the prune command shouldn't start to modify snapshots.

You are right - in general missing data blobs are not a that big issue when thinking about the sanity of the repository. Of course, data might be lost which might be a bad thing for users. But starting with a repository where you have missing data blobs, probably the best is actually to remove them from the index but let the trees still reference them. If another backup run will stumble over the missing data, the repository will be healed. And if not, all functionality is working perfectly except if you need to access that specific data, i.e. a file which contains a missing blob. (and even restore and mount could work around this and include the functionality I propose here for missing blobs)

Another issue is missing tree blobs. This severely damages the repository as most commands like check and prune will no longer work. For this kind of damage IMO it is best to modify trees and snapshots to remove these damaged subtrees which is a much better solution than complete delete the snapshot.

I've missed the --append option and yes it helps a lot if blob sizes are missing, but I'd still consider this as fixing the symptoms. The underlying problem I had in mind is the repository format itself: it can't store the size of missing blobs without risking further damaged snapshots, but when rebuilding the index to fix this we loose the information about the position of the data blobs.

I recently thought about a check when loading the index whether the pack files referenced in the index still exist in the backend. This would allow users to just remove damaged packs (missing packs would be already detected) and then an information in the index could be saved whether the blob is present or only the length saved in the index may be used. I rejected to implement this as a regular check as it would mean that every 'LoadIndex' would need to 'List' the pack files which could be expensive. Maybe that might be a direction to think of?

aawsome avatar Aug 28 '20 21:08 aawsome

I recently thought about a check when loading the index whether the pack files referenced in the index still exist in the backend. This would allow users to just remove damaged packs (missing packs would be already detected) and then an information in the index could be saved whether the blob is present or only the length saved in the index may be used. I rejected to implement this as a regular check as it would mean that every 'LoadIndex' would need to 'List' the pack files which could be expensive. Maybe that might be a direction to think of?

I don't think we want to list all pack files every time the repository index is loaded, it really sounds (needlessly) expensive. It would also only be a partial workaround, as having to rebuild the repository index would still loose the information. And we'd also get into a lot of trouble when mixing older and newer restic version which have differing opinions on which blobs actually exist and which don't. The only way to properly fix this is to handle this in a repository format v2 #628 .

MichaelEischer avatar Aug 29 '20 09:08 MichaelEischer

For the second time I have a corrupted repo - too bad. Probably the reason was that the backup disk was full. I tried ./restic -r$REPO repair and it proposes fixes, as expected.

However, when I run 'make test' I get ? github.com/restic/restic/internal/limiter [no test files] ? github.com/restic/restic/internal/migrations [no test files] ? github.com/restic/restic/internal/mock [no test files] --- FAIL: TestOptionsApplyInvalid (0.00s) --- FAIL: TestOptionsApplyInvalid/test-2 (0.00s) options_test.go:216: expected error "time: missing unit in duration 2134", got "time: missing unit in duration "2134"" FAIL FAIL github.com/restic/restic/internal/options 0.014s ok github.com/restic/restic/internal/pack 0.018s

other tests are o.k.

Tested on ARCH 5.8.7-arch1-1

pddp avatar Sep 14 '20 07:09 pddp

Hi @pddp thanks for reporting the issue!

FAIL github.com/restic/restic/internal/options 0.014s

Actually, this PR doesn't change anything on /internal/options. Did you redo your test with master? Seems to me that this might be an uncorrelated issue..

aawsome avatar Sep 14 '20 14:09 aawsome

Hi @aawsome

your guess was correct - this was uncorrelated. Also happens on master.

pddp avatar Sep 14 '20 16:09 pddp

My repo seems to be corrupted after I ran rebuild-index. Is it a bad idea to check out this branch and invoke the repair command?

huyz avatar Feb 16 '21 15:02 huyz

I also rebased to current master and resolved all typos in cmd_repair.go - thanks @huyz!

I'll keep the troubleshooting.rst unchanged, is is mainly a topic of #2878 (I think this will be removed from this PR once this is reviewed)

aawsome avatar Feb 20 '21 20:02 aawsome

I re-rain the repair command and it cleared up all my errors. Thanks!

huyz avatar Feb 22 '21 11:02 huyz

Any updates on this? the command has saved me a couple of times and it would be neat to have it mainlined :)

kradalby avatar May 16 '21 10:05 kradalby

This works for me too, I repaired three B2 repos. Unfortunately, I had some instances of "the root tree is damaged -> delete snapshot". A cursory investigation showed that these snapshots where not recoverable. I didn't do a deep-dive (I have alternative backups, and these were older snapshots), and just got rid of the errand snapshots.

noduck avatar May 23 '21 07:05 noduck

Just another success story to add. Deleted some files by mistake from a repository with 27 snapshots. Neither prune nor rebuild-index was able to do any help, also I couldn't find anything convenient in the options. Didn't try recover since my snapshots folder did not have any losses. Without repair, I had to trace all missing packs and forget related snapshots I assume. I tried this one as:

  • Ran backup with --force option on repo's 3 clients
  • Ran rebuild-index on the repo
  • Ran repair --dry-run=false --delete-snapshots
  • Ran check --read-data (succeeded) Ended up with 22 snapshots, 20 of them has the tag repaired

Thanks for this @aawsome

seqizz avatar Feb 01 '22 07:02 seqizz

I also used the command successfully. In my setup I used OneDrive over rclone as target and I had about 160GB and 14 snapshots. Before I executed the repair I was not able to run prune because it was missing a blob. I already found out that the file contained was just some cache file I was not interested but it was present in all snapshots.

The command took a long time to execute (about 45 hours) but was able to keep all snapshots and prune ran fine afterwards.

I really think this should be merged. There is no other reasonable possibility to repair in such cases. That doesn't involve command line hacking and examining and running debug commands with your own build etc.

Here the steps I took to run it:

git clone https://github.com/aawsome/restic.git
cd restic
git checkout new-repair-command
make
./restic -r /myrepo repair 
./restic -r /myrepo repair --dry-run=false --delete-snapshots

patrickuhlmann avatar Apr 18 '22 21:04 patrickuhlmann

I've rebased the PR such that it includes support for repository format version 2 repositories.

MichaelEischer avatar Jun 17 '22 15:06 MichaelEischer

Very useful PR. It help me repair my 10Ti+ restic backup with 100+ snapshots. After repaired, I can force backup the corrupted files.

jim3ma avatar Jul 20 '22 10:07 jim3ma

I recently tried this branch as well and had a special case: When pruning I got the error that a packfiles hash does not match the expected hash. So I ran the repair command, but it didn't show anything to repair. So I just fell back to the alternative route: Deleting the problematic packfiles. Which - to my surprise - yielded error-free check and prune runs. I'm not sure if it's worth handling that case and if so, if the repair command should be extended or the prune command should be smarter (not erroring on unneeded packs).

JaCoB1123 avatar Jul 20 '22 20:07 JaCoB1123

@aawsome Any updates ? Some actions failed, can we fix it ?

jim3ma avatar Oct 10 '22 03:10 jim3ma

@aawsome Any updates ? Some actions failed, can we fix it ?

Yes, seems that some Windows tests are failing. I am speculating as there are no logs available any longer, but there must be some incompatible changes in master.

My priority, however, ATM is to improve rustic, where this functionality is currently in development (see https://github.com/rustic-rs/rustic/pull/258) and I'm pretty sure it will be released in near future. Note that you can use rustic to access and modify a restic repository.

aawsome avatar Oct 13 '22 13:10 aawsome

Just wanted to point out that the --no-lock switch seems to not work with repair. I have a read-only repository and I was just curious what a dry run of repair would do, but it insisted on creating the lock despite --no-lock and wouldn't work. Might be by design, and I'd understand if so - but I thought I'd mention it!

akrabu avatar Nov 10 '22 19:11 akrabu

@akrabu You are right. There is no special treatment for --dry-run with respect to locking and the standard is to lock the repo in restic.

But you can try out rustic repair snapshots --dry-run (from rustic main branch) on your repository - rustic works completely lock-free and never creates a lock file.

aawsome avatar Nov 10 '22 22:11 aawsome

I've updated and massively refactored the PR:

  • the command has moved to repair snapshots
  • rebuild-index has moved to repair index
  • repair snapshots now has largely the same options as rewrite. The CLI output is also rather similar. In fact, both commands now share significant parts of their implementation.
  • rewritten troubleshooting section. It is now more of a step by step guide. I might have overdone it with encouraging users to ask for help if necessary...

Repaired files are currently not suffixed as this could break the invariant the tree nodes must be sorted in lexicographical order. Like the rewrite command, --dry-run also works without requesting a lock.

Current todos:

  • [x] Check the code / documentation again
  • [x] Update changelog
  • [x] Add basic tests for the repair snapshots command
  • [x] Figure out whether we want to split the repair index and rewrite refactoring into a separate PR.

MichaelEischer avatar Dec 28 '22 14:12 MichaelEischer

LGTM.

MichaelEischer avatar May 05 '23 21:05 MichaelEischer

Just wanted to say that this PR saved me a lot of work. I had a corrupted repository, I couldn't figure out the issue, and every snapshot seemed to be corrupt missing at least one blob. By rebuilding restic from source to include this PR, I was able to fix the repository very easily, and have now recovered about 95% of the original files.

Thank you so much @aawsome and all others involved! This is evidently a very valuable feature for restic, it is greatly appreciated.

DevJake avatar May 11 '23 19:05 DevJake