duplicacy icon indicating copy to clipboard operation
duplicacy copied to clipboard

check -files optimization?

Open Ossssip opened this issue 7 years ago • 6 comments

Currently if duplicacy given check -files command, it

will download chunks and compute file hashes in memory, to make sure that all hashes match.

My impression is (just from execution times) that it's done completely independently for each snapshot, i.e. if I have two snapshots with exactly same files (or with just few new chunks), it will download and check the entire chunk set for each snapshot. Am I right? I am wondering if it is possible, in case of checking of several snapshots, to check only altered pieces of backup for each following snapshot? It would reduce the data transfer overhead and execution time significantly.

Ossssip avatar Nov 17 '17 18:11 Ossssip

Right, the -files option checks snapshots independently and doesn't skipped files that have been verified. It shouldn't be hard to create a map to store verified files and skip files if they can be found in this map.

I also think we need a -chunks option which would verify each chunk rather than individual files.

gilbertchen avatar Nov 18 '17 02:11 gilbertchen

The ability to run partial checks with the -files option that resume where the previous check left off would also be handy. This would be useful for large backup sets on remote destination, where a full check could then be split into smaller jobs over several weeks/months.

Perhaps something similar to what HashBackup does. Though how this might work with old revisions/chunks being pruned etc. between checks I'm not sure.

Incremental checking, for example, --inc 1d/30d, means that selftest is run every day, perhaps via a cron job, and the entire backup should be checked over a 30-day period. Each day, 1/30th of the backup is checked. The -v3, -v4, and -v5 options control the check level, and each level has its own incremental check schedule. For huge backups, it may be necessary to spread checking out over a quarter or even longer. The schedule can be changed at any time by using a different time specification.

You may also specify a download limit with incremental checking. For example, --inc 1d/30d,500MB means to limit the check to 500MB of backup data. This is useful with -v4 when cache-size-limit set. In this case, archives may have to be downloaded. Many storage services have free daily allowances for downloads, but charge for going over it. Adding a download limit ensures that incremental selftest doesn't go over the free allowance. The download limit is always honored, even if it causes a complete cycle to take longer than specified.

thrnz avatar Nov 19 '17 08:11 thrnz

Duplicate: https://github.com/gilbertchen/duplicacy/issues/477

TheBestPessimist avatar Jun 12 '19 04:06 TheBestPessimist

Is there any progress on this? The repeated download of same chunks is very time-consuming over slower Internet connections.

riobard avatar Apr 18 '20 09:04 riobard

And how is new -chunks https://github.com/gilbertchen/duplicacy/commit/22d6f3abfc61c12d6c2d518f5bdd7e0f346c5a10 different to this one? Is -chunks replacement? Or -files can find some kind corruption that -chunks can't ? Thanks

mr-flibble avatar Apr 18 '20 12:04 mr-flibble

And how is new -chunks 22d6f3a different to this one? Is -chunks replacement? Or -files can find some kind corruption that -chunks can't ? Thanks

Update: now I finally understand it: https://forum.duplicacy.com/t/cli-release-2-5-0/3448/3 gchen:

-files checks the integrity of each file – it is basically the same as a full restore operation without writing files to the local disk. -chunks only checks the integrity of each chunk. It is possible that -chunks reports no errors but some files can’t be restored due to a bug in Duplicacy’s code or memory corruption that happens in the backup operation.

mr-flibble avatar May 11 '20 07:05 mr-flibble