Deduplicated size of a given set of archives?
Have you checked borgbackup docs, FAQ, and open Github issues?
Yes, in praticular https://borgbackup.readthedocs.io/en/stable/usage/info.html
Is this a BUG / ISSUE report or a QUESTION?
A question (or feature request).
System information. For client/server mode post info for both machines.
Your borg version (borg -V).
borg 1.1.15
Operating system (distribution) and version.
Linux (Debian sid).
Describe the problem you're observing.
Let's create two directories to backup, each containing a duplicated file (respectively 10M and 100M):
$ mkdir userA userB
$ dd if=/dev/urandom of=userA/fileA bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.257317 s, 40.8 MB/s
$ dd if=/dev/urandom of=userB/fileB bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.54954 s, 41.1 MB/s
$ cp userA/fileA userA/fileA2
$ cp userB/fileB userB/fileB2
$ tree userA userB
userA
├── fileA
└── fileA2
userB
├── fileB
└── fileB2
0 directories, 4 files
Then, create an archive for userA and another for userB:
$ export BORG_REPO=borg
$ borg init -e none
$ borg create -Cnone ::userA-monday userA
$ borg create -Cnone ::userB-monday userB
The deduplicated size is (as expected) half the original size, for each user:
$ borg info ::userA-monday | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 20.97 MB 20.97 MB 10.49 MB
All archives: 230.69 MB 230.69 MB 115.35 MB
$ borg info ::userB-monday | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 209.72 MB 209.72 MB 104.86 MB
All archives: 230.69 MB 230.69 MB 115.35 MB
Now, let's create a new archive for each user (without any changes):
$ borg create -Cnone ::userA-tuesday userA
$ borg create -Cnone ::userB-tuesday userB
Now, the deduplicated size is basically 0 for all archives (if we remove exactly one archive, it won't save any space):
$ borg info ::userA-monday | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 20.97 MB 20.97 MB 433 B
All archives: 461.39 MB 461.39 MB 115.35 MB
$ borg info ::userA-tuesday | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 20.97 MB 20.97 MB 435 B
All archives: 461.39 MB 461.39 MB 115.35 MB
$ borg info ::userB-monday | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 209.72 MB 209.72 MB 433 B
All archives: 461.39 MB 461.39 MB 115.35 MB
$ borg info ::userB-tuesday | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 209.72 MB 209.72 MB 435 B
All archives: 461.39 MB 461.39 MB 115.35 MB
But in practice, we may still be interested in knowning the deduplicated size of all archives of userA. Is it possible to get this information easily?
Currently, it seems we can retrieve the global deduplicated size ("All archives"), or the deduplicated size of a single archive, but not that of a set of archives (typically for a given prefix). As a consequence, as soon as a new archive is created with few changes, the deduplicated size is meaningless in practice.
The filtering options of borg info just list individual archives matching the filter, but not the deduplicated size of the set of resulting archives:
$ borg info --prefix userA- | grep -A2 Dedup
Original size Compressed size Deduplicated size
This archive: 20.97 MB 20.97 MB 433 B
All archives: 461.39 MB 461.39 MB 115.35 MB
--
Original size Compressed size Deduplicated size
This archive: 20.97 MB 20.97 MB 435 B
All archives: 461.39 MB 461.39 MB 115.35 MB
Yeah, for the archive's deduped size, borg always computes what this archive adds (size of all unique chunks [chunks ONLY used by this archive]) compared to the rest of the repo.
Related: #71
Related: #8514 (hashtable flags)
Guess we can implement this now:
- allocate flags bit C (== defining set C members) for the considered set of chunks (union of all chunks referenced by considered archives).
- allocate flags bit R (== defining set R members) for the rest set of chunks (union of all chunks referenced by the rest of the archives)
- the unique size of the considered archives is then sum(size(x) for x in C - R), summing up all chunk sizes of chunks with C bit set and R bit not set.
memory needs is not an issue (basically zero additional).
but runtime is O(archives count * chunks referenced per archive). this runtime would be needed per combination of considered/rest archives
so guess this might be similar to a borg compact run considering the runtime.
is it worth implementing this?
Or, with more memory:
- allocate one bit per archive, set the bit for the chunks referenced by that archive
- with that, the unique size of any combination of considered archives / rest of archives can be determined by looking at this built-only-once bitvector.
- the unique size is sum(size(x) for x in chunks if any(C(n) bit set) and not any(R(m) bit set))
memory needs: additional archives_count/8 * chunks_count bytes
runtime is O(archives count * chunks referenced per archive), but only once - after that (as long as chunkindex is in memory) any combination can be queried quickly.
The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives.
If we would be only interested in consecutive archives, the queries amount would be reduced to O(archives_count^2). This could be further reduced by limiting the length of such a consecutive archives "run", which makes sense especially if the archives count is rather high (and the archives are from same backup source data set).
allocate one bit per archive, set the bit for the chunks referenced by that archive
That makes sense.
The problem in general is the amount of queries needed: if my maths aren't wrong, one would need 2^archives_count queries against that index to check all combinations of selected archives vs. remaining archives.
But typically we don't want to check all combinations of selected archives, just measure the "marginal/deduplicated size" of a given subset of archives (or I didn't understand what you said).
@rom1v If you have fixed sets R and C, yes, you only need to query the index once.
Building the index needs to iterate over all archives though (and some people have many archives), so that does not scale very well (similar to the archives part of borg check).
If the sets are fixed and only that one query is intended, the 2 bit approach is better and the n bit approach would just consume more memory and would not give any advantage.