duplicacy
duplicacy copied to clipboard
check -files optimization?
Currently if duplicacy given check -files
command, it
will download chunks and compute file hashes in memory, to make sure that all hashes match.
My impression is (just from execution times) that it's done completely independently for each snapshot, i.e. if I have two snapshots with exactly same files (or with just few new chunks), it will download and check the entire chunk set for each snapshot. Am I right? I am wondering if it is possible, in case of checking of several snapshots, to check only altered pieces of backup for each following snapshot? It would reduce the data transfer overhead and execution time significantly.
Right, the -files
option checks snapshots independently and doesn't skipped files that have been verified. It shouldn't be hard to create a map to store verified files and skip files if they can be found in this map.
I also think we need a -chunks
option which would verify each chunk rather than individual files.
The ability to run partial checks with the -files
option that resume where the previous check left off would also be handy. This would be useful for large backup sets on remote destination, where a full check could then be split into smaller jobs over several weeks/months.
Perhaps something similar to what HashBackup does. Though how this might work with old revisions/chunks being pruned etc. between checks I'm not sure.
Incremental checking, for example, --inc 1d/30d, means that selftest is run every day, perhaps via a cron job, and the entire backup should be checked over a 30-day period. Each day, 1/30th of the backup is checked. The -v3, -v4, and -v5 options control the check level, and each level has its own incremental check schedule. For huge backups, it may be necessary to spread checking out over a quarter or even longer. The schedule can be changed at any time by using a different time specification.
You may also specify a download limit with incremental checking. For example, --inc 1d/30d,500MB means to limit the check to 500MB of backup data. This is useful with -v4 when cache-size-limit set. In this case, archives may have to be downloaded. Many storage services have free daily allowances for downloads, but charge for going over it. Adding a download limit ensures that incremental selftest doesn't go over the free allowance. The download limit is always honored, even if it causes a complete cycle to take longer than specified.
Duplicate: https://github.com/gilbertchen/duplicacy/issues/477
Is there any progress on this? The repeated download of same chunks is very time-consuming over slower Internet connections.
And how is new -chunks https://github.com/gilbertchen/duplicacy/commit/22d6f3abfc61c12d6c2d518f5bdd7e0f346c5a10 different to this one? Is -chunks replacement? Or -files can find some kind corruption that -chunks can't ? Thanks
And how is new -chunks 22d6f3a different to this one? Is -chunks replacement? Or -files can find some kind corruption that -chunks can't ? Thanks
Update: now I finally understand it: https://forum.duplicacy.com/t/cli-release-2-5-0/3448/3 gchen:
-files checks the integrity of each file – it is basically the same as a full restore operation without writing files to the local disk. -chunks only checks the integrity of each chunk. It is possible that -chunks reports no errors but some files can’t be restored due to a bug in Duplicacy’s code or memory corruption that happens in the backup operation.