Request: analyze which files/directories are using the most storage
Have you checked borgbackup docs, FAQ, and open GitHub issues?
Yes.
Is this a BUG / ISSUE report or a QUESTION?
Feature request.
System information
N/A
Feature request
I think it would be useful if Borg could generate a list showing which files and directories have been using the most storage space (after compression and deduplication) in a repo within a certain time period (such as in the last month). This would be helpful for finding directories that are wasting space in the repo and the user might have accidentally forgotten to exclude.
My inspiration for this is the git-filter-repo --analyze option, which creates a report of which files in a Git repo have used the most space throughout the repo's history. A borg analyze command could look something like that.
Example `git-filter-repo` analysis for the Borg repo
=== All directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
744008017 26787700 <present> <toplevel>
517286397 12653532 <present> src/borg
517286397 12653532 <present> src
104691374 10788204 <present> docs
11934188 5279538 <present> docs/internals
16803244 2387797 <present> src/borg/algorithms
115776607 2116171 <present> src/borg/testsuite
13993693 1921949 <present> src/borg/algorithms/zstd/lib
13993693 1921949 <present> src/borg/algorithms/zstd
80398937 1552724 <present> borg
2621210 1193969 <present> docs/misc
11735486 893532 <present> docs/man
9106300 720133 <present> docs/usage
3752583 686089 <present> src/borg/algorithms/zstd/lib/compress
5965792 479344 <present> src/borg/algorithms/zstd/lib/legacy
15161225 435674 <present> attic
1631753 358329 <present> docs/misc/asciinema
17886036 333528 <present> borg/testsuite
7417804 322251 <present> src/borg/helpers
4493375 267651 2013-07-09 darc
1256006 246006 <present> src/borg/algorithms/zstd/lib/common
1535426 240287 <present> src/borg/algorithms/xxh64
9980617 239345 <present> src/borg/archiver
10038085 212145 <present> src/borg/testsuite/archiver
1234524 204819 <present> src/borg/algorithms/zstd/lib/decompress
7281292 200862 <present> src/borg/crypto
2799285 163539 <present> scripts
747157 155625 <present> src/borg/algorithms/lz4/lib
747157 155625 <present> src/borg/algorithms/lz4
2624369 141717 <present> scripts/shell_completions
862668 125015 <present> src/borg/algorithms/zstd/lib/dictBuilder
967385 109191 2019-05-13 src/borg/_msgpack
987519 103475 2010-10-27 dedupestore
1502405 93505 <present> src/borg/platform
1524670 74299 <present> scripts/shell_completions/zsh
2475851 59442 <present> attic/testsuite
596814 46152 <present> docs/deployment
407120 39587 <present> docs/usage/general
874383 39206 <present> scripts/shell_completions/fish
225316 28212 <present> scripts/shell_completions/bash
92775 27211 <present> docs/_static
189831 26596 2010-03-01 dedupstore
414837 25466 <present> .github
410788 24095 <present> .github/workflows
142992 23397 2017-05-02 src/borg/_crc32
69599 19679 2021-01-28 src/borg/algorithms/blake2
90056 19502 2016-01-24 borg/support
354626 18382 <present> src/borg/cache_sync
77575 16404 <present> src/borg/algorithms/zstd/lib/deprecated
84075 15074 2020-12-21 .travis
222060 12566 <present> src/borg/algorithms/msgpack
41389 11046 2021-01-28 src/borg/algorithms/blake2/ref
28210 8642 <present> src/borg/blake2
17281 8199 <present> requirements.d
70103 8080 <present> docs/borg_theme/css
70103 8080 <present> docs/borg_theme
65232 6249 2015-10-12 docs/_themes
40066 5064 <present> deployment/windows
40066 5064 <present> deployment
104163 4258 2013-07-09 darc/testsuite
11638 3982 <present> docs/3rd_party
53338 3683 2015-10-12 docs/_themes/local
7968 3133 2022-02-27 docs/3rd_party/blake2
11894 2566 2015-05-13 docs/_themes/attic
45939 2393 2015-10-12 docs/_themes/local/static
9171 1553 2015-05-13 docs/_themes/attic/static
2012 765 <present> scripts/fuzz-cache-sync
1608 735 <present> scripts/make-testdata
3235 661 2010-10-31 doc
1032 608 <present> docs/_templates
1530 451 2022-02-26 docs/3rd_party/zstd
328 269 2013-06-24 fake_pyrex
231 177 2013-06-24 fake_pyrex/Pyrex
204 142 2013-06-24 fake_pyrex/Pyrex/Distutils
266 124 <present> scripts/fuzz-cache-sync/testcase_dir
614 117 <present> docs/3rd_party/msgpack
1311 110 2022-02-26 docs/3rd_party/lz4
It would also be interesting to see a feature that does something similar for "time spent backing up" instead of storage used, although I don't know if that would be feasible.
Borg does not yet have such a feature, but guess it would be possible to implement the space-usage analysis.
It is not possible to analyse the time spent for backing up some file/dir, we only have the overall backup time for a backup archive, but no more fine-granular timing data.
Implementation notes:
- "within a certain time period" - borg already has some means to select some archives via cli options (like
-a,--last N, etc.) - these can be reused/extended. - this operation is relatively expensive,
O(N_archives_considered * archive_size) - due to the deduplication, doing a meaningful space-usage analysis is not trivial, make sure that the implementation actually makes sense / is useful.
Since it sounds like individual file timing isn't implemented, I made a quick Python script that ranks directories on their backup times in case anyone else finds that useful. It requires a timestamped backup log, which can be generated with borg create --list ... | ts -s "%.s" | tee borg_log.txt.
from collections import defaultdict
path_backup_times = defaultdict(float)
with open("borg_log.txt", "r") as file:
previous_timestamp = 0
for line in file:
parts = line.split()
if len(parts) >= 3:
timestamp = float(parts[0])
file_flag = parts[1]
file_path = " ".join(parts[2:])
# See https://borgbackup.readthedocs.io/en/latest/usage/create.html#item-flags
if file_flag in ["A", "M", "U", "C", "E"]:
backup_time = timestamp - previous_timestamp
path_components = file_path.split("/")
for i in range(1, len(path_components) + 1):
component = "/".join(path_components[:i])
path_backup_times[component] += backup_time
previous_timestamp = timestamp
sorted_paths = sorted(path_backup_times.items(), key=lambda x: x[1], reverse=True)[:20]
for rank, (path, backup_time) in enumerate(sorted_paths, start=1):
print(f"{rank}. {path} ({round(backup_time)}s)")