Expose hash over ChunkListEntry to `borg list --format {chunk_ids_$HASH}`
- First time messing with the code -- No test yet, can somebody help writing one?
- Existing tests and flake give no warnings
- Not sure about the format key naming, suggestions?
Rationale: because checksums are so slow, and borg diff is much faster, I tried to expose to borg list a hash over the data that borg diff seems to use to compare items
Codecov Report
:x: Patch coverage is 44.44444% with 10 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 83.86%. Comparing base (08a12cc) to head (d7b3620).
:warning: Report is 3524 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/borg/helpers/parseformat.py | 44.44% | 10 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #5167 +/- ##
==========================================
- Coverage 83.98% 83.86% -0.12%
==========================================
Files 37 37
Lines 9839 9854 +15
Branches 1638 1640 +2
==========================================
+ Hits 8263 8264 +1
- Misses 1097 1110 +13
- Partials 479 480 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
yeah, I see the use case. but i am somehow concerned that a lot of people might use it incorrectly, draw wrong conclusions and file support tickets if outcome is not as they expected...
Good point. I will try to add a hashsum for the chunker parameters itself as well, and add a note in the format key documentation that it's only valid to compare if those are equal.
otoh, chunker params usually do not change that often and also false positives are unlikely (different files showing same hash).
Is there anything here that could cause false positives besides the usual hash algo collisions? Are those more likely here?
but false negatives could happen (same files showing different hash due to different chunker params - or even different hash algo as you let people choose it).
Hmm, why can the choice of hash algo influence false negatives?
One more question: this is doing the correct thing, isn't it? If chunker parameters are equal, equal checksums of the chunk ids would mean content of item/file is identical?
Can you maybe write a test for this, or point me to a test that I can copy&modify that tests similar functionality?
-
I don't think a hash over the chunker params is useful. These are only a few values and looking at them "as is" is simpler than at a hash of them.
-
I don't think so.
-
If you first create a list with hash algo x and later accidentially create a list with hash algo y, the hashes for the same file will be different. If the hash bit length is different, then the reason might be pretty obvious, but if only the algorithm is different and length is same, it is not that obvious.
Yes, i think it is correct. The borg diff code has identical = (item1.chunks == item2.chunks) if can_compare_chunk_ids is True. And comparing hashes of lists is about as good as comparing the complete list, except for the minimal risk of a hash collision.
but false negatives could happen (same files showing different hash due to different chunker params
While ideally, a hash that could be quickly generated without having to look at the file content that still guarantees the content to be identical (collisions aside) would of course be what one would prefer, in practice it is not possible.
Possible are the existing hashes/checksums that are slow due to looking at the file content.
Also possible is this shortcut of quiclky getting the list of chunks, that, once established (by other means) that the actual file content has been put into these chunks and that (by borg check) that the content of the chunks is unchanged, allows to deduce that the file still has the same contents as it still is being composed of the same chunks in the same order.
If however one does rechunk a repo, each chunk is read, the data is processed (encryption, compression) and written to a new, different chunk, then without another separate file content verifycation, there is no guarantee that no bit error has happened in that process, so to me it is the desired behaviour for a hash based on the list of chunks to be different, because it is different chunks. One can quiclky gerenate new lists of hashes of the new chunks and after (still way slower) having somehow verified all new chunk contents benig correct as well, the new hashes can be used like the old ones.
1. I don't think a hash over the chunker params is useful. These are only a few values and looking at them "as is" is simpler than at a hash of them.
Yes, my first thought as well, plus for people that do change them at some point, seeing which ones (ie. the current or older ones) where used for that file of list output is a benefit as well.
And comparing hashes of lists is about as good as comparing the complete list, except for the minimal risk of a hash collision.
Minus the potential extra "collision", if two chunks of different data due to different chunker params end up with the same hash (which for a sigle chunk file would be enough). That could be prevented by doing the hash over the chunker params and the chunk IDs, calling it something like {chunks_fingerprint_*} to reflect it being something abstract that characterizes the chunks the file consists of.
3. If you first create a list with hash algo x and later accidentially create a list with hash algo y, the hashes for the same file will be different. If the hash bit length is different, then the reason might be pretty obvious, but if only the algorithm is different and length is same, it is not that obvious.
The exact same could be said of the various existing hashes of file contents, yet noone (I hope :wink:) would pose this as a valid argument to remove these/substitute them by a single one, thus I do not see this as a valid argument here. With a fixed one, should that at some point be deemed not good enough anymore, or worse, outright broken, there needs to be some transition and people need to wait for a newer borg with the new hash to trickle down to the next stable release of their distro of choice, etc. With the ability to choose any of the supported hash algorithms, peolpe can generate lists using one, or multiple hashes, either now (to be prepared for one of them to become "bad") or (right) at the point the information of the one they use being borken going public. If more concerned about data corruption by bad hardware rather than advanced attackers, they can also still use the old hash to verify integrity against old hash listst and subsequently redo them with new hashes or whatever fits their use case.
I meanwhile have the functionality I was after, ie. list of actual chunks, not a hash over that, implemented in shell using borg debug and jq:
borg debug dump-archive REPO::ARCHIVE - \
| jq '._items[] | if .chunks != null then "\(.path) \(.chunks | map(.[0] | ltrimstr("\u007f")) | join(","))" else "\(.path) -" end' \
| sed --expression='s/^"//;s/"$//'
Not sure yet, why borg debug inserts that unicode DELETE char in front of each chunk ID, but above command produces the same output as my own (hackish) patch adding a chunkids format key to list did. :smiley:
https://github.com/borgbackup/borg/blob/1.1-maint/src/borg/helpers.py#L1240 comment about special character.
Due to @solrize asking about it on IRC, I have thought about such a "full file hash" (computed over the chunkids list) again.
Guess, as it is now, the code here can only be used to compare files in the same repo, IF chunker_params are the same.
With borg2, we'll get the concept of "related repositories" that share the same chunker secret and the same id-hash secret, thus (if chunker params are the same also), files will get chunked exactly the same way and the chunk ids will also be computed the same way. So we could quickly determine identical files in different related repos if these conditions are met.
conditions_fingerprint = H(chunker_secret, chunker_params, id_hash_secret, id_hash_algo)
chunks_fingerprint = H(chunks_id_list)
fingerprint = conditions_fingerprint + "-" + chunks_fingerprint
Note:
- if left part (conditions fp) is already different, it means there is no point in comparing the fingerprints.
- if left part is identical and right part (chunks fingerprint) is different, it means the file content is different
- if left part is identical and right part (chunks fingerprint) is identical, it means the file content is identical
H could be one specific cryptographic hash. Considering sha256 is hw accelerated in modern CPUs and data isn't that much anyway, we could just take that.
Also: we could store the fingerprint into item metadata at borg create time, if we like. Maybe not worth it though as it should be rather quick to compute at any time.