borg
borg copied to clipboard
borg2: it's coming!
update: as there was no negative feedback from alpha testing, borg2 branch was merged into master, thus that big change in form of a major / breaking borg 2.0 release is coming.
read below about what's planned and what's already done.
what could be done if we decide to make a breaking release (2.0) that:
- does not try to be compatible with old repos
- does not try to be 100% compatible with old cli syntax (but 90%)
- only uses new repos / keys it created itself
- only gets old archives via borg import-tar or borg transfer
putting all the breaking stuff into 1 release is good for users (1 time effort), but will take quite some time to test and release.
After borg 2.0, we'll make a N+1 release (2.1? 3.0?) that drops all the legacy stuff from the codebase, including the converter for borg < 2.0 repos.
borg 2.0 general comments
DONE: offer a borg transfer
command, #6663, that transforms old stuff only to stuff that will still be supported by borg N+1.
N+1 general comments
much of the stuff described here has own tickets, see "breaking" label / add issue links here.
2.0 crypto
- DONE repo-create: do not create old keys (pbkdf2, legacy AES class, encrypt-and-mac)
- DONE repo-create: do not create AES-CTR based repos, only new AEAD ciphers with session keys
- DONE: remove all docs talking about potential nonce reuse, counter management and related
- DONE: remove key algorithm change (pbkdf2<->argon2), just use argon2 for new repos/key
- DONE: nonce management code for aes-ctr, not needed any more with session keys, remove nonces module #7556
- keep old crypto code, we need it to read / decrypt old repos
N+1 crypto
- remove pbkdf2 + pbkdfs/sha256 keys + docs - we have argon2 now
- remove low_level.AES class which is only used for pbkdf2 key encryption
- remove aes-ctr mode
- remove support for super-legacy passphrase-key type (not supported since long)
- we used hmac-sha256 and blake2b as id-hashes in the past, thus we need to keep them because we need an efficient
borg transfer
(not needing to re-hash)
2.0 repo
- DONE: implement new repository based on
borgstore
, #8332 - DONE: implement sftp: borgstore backend (a remote backend that does not need borg serve on the remote).
- DONE keep support for reading borg 1.x repos
- DONE read-compatibility with old local repos for
borg transfer
, and/or - DONE read-compatibility with old RPC (ssh: repos) for
borg transfer
(in that case the old repo would be served by an old borg version) - DONE borg check only checks new repos, no support for old repos
- DONE only generate latest hints format (this is done since long)
- DONE remove detection of attic repos, #6859
- DONE remove free nonce / nonce reservation api
N+1 repo
- remove support for reading borg 1.x repos
- if we reduce MAX_OBJECT_SIZE from ~20MiB to 16.000.000, 24bit are enough for the entry length
- better alignment for segment entries together with the 1 type byte.
- in-memory indexes have 8 free bits without using more memory
- max archive size goes down by 20%!
- do not allow 16MiB objects, we need some room for potential header size increase in future
-
nope, we cannot do that: we need an efficient
borg transfer
, not needing to re-chunk content!
2.0 indexes / cache
- DONE remove
legacy_cleanup
function - DONE no repo index needed anymore, objects are stored separately and directly access via their id
- DONE no chunks index persisted/synchronized anymore, existing chunks are queried from the repo.
N+1 indexes / cache
- remove legacy indexes / caches
2.0 msgpack
- DONE keep bigint for reading old stuff, write with Timestamp instead of bigint #2323
- DONE read legacy str/bytes types for reading old stuff, write only more modern msgpack types (for str vs bytes), #968
N+1 msgpack
- drop support for bigint stuff #2323
- drop support for legacy msgpack str/bytes types #968
2.0 archive / item
- DONE always have borg12_meta in newly created archives (includes
borg transfer
) (note:Archive.save()
adds that) - DONE #6763 we always have the borg12_meta, we don't have "old" archives: remove support cache for old archives
- DONE #6763 drop support for csize in item chunk list #2357
- DONE transfer: clean stoneage attic bug leftovers, remove erroneous b'acl' key from item metadata
- DONE remove stoneage attic bug support in borg check -> valid_item (erroneous b'acl' key)
- DONE upgrade items with chunks to always have precomputed
size
- DONE #6763 read: also still support csize in item chunk list #2357 ("support" here means to tolerate 3-tuples and just throw the csize away and make a 2-tuple of it)
- DONE #6763 write: get rid of csize in item chunk list #2357 (borg transfer)
- get rid of hardlink_master and complex code dealing with it, tends to cause issues. #855 #2325
- DONE: borg transfer converts this
- DONE: borg create / extract needs to work the same way with hlid
- DONE: borg recreate
- "DONE": borg import-tar - drop support for hardlinks for now, see ticket ...
- DONE: borg export-tar
- borg import/export-tar BORG mode?
- DONE #2388
- DONE #5607
- resolve dual use of item.source (hardlink and symlink), #2343
- DONE: borg transfer converts this
- DONE: borg create / extract needs to work the same way with hlid
- DONE: borg recreate
- borg import/export-tar BORG mode?
- rename item.source to item.target for symlinks?
- rename item.bsdflags to item.flags or item.fsflags?
N+1 archive / item
- Item.get_size: remove support for items with chunks, but without precomputed
size
2.0 or N+1 checksums
- DONE we could even consider removing libdeflate in 2.0. the only major user will be "borg transfer" and that will be a one-time per repo usage.
- DONE remove libdeflate again and use zlib.crc32 from stdlib, PUT2 format only uses crc32 for header data, not much data getting crc'ed
2.0 compression
- DONE borg transfer can remove the dirty type bytes hack for zlib, add cleaner dispatch / new handler
- DONE we need to keep all compression algorithms, so borg transfer does not need to recompress
- DONE #6701
- DONE #6698
N+1 compression
- drop support (dispatching / handler) for the zlib dirty type bytes hack (ZLIB_legacy)
- we need to keep all other compression algorithms, because borg transfer did not recompress
2.0 upgrade
- DONE remove borg upgrade, doing upgrades from attic / old borg (they need to first upgrade to 1.2 and then use borg transfer)
N+1 archiver
- remove unneeded stuff from benchmark cpu
2.0 remote
- DONE dropped support for borg < 1.1.0 (tuple rpc data format, $LOG non-json log format, repo version check) #7603
- DONE do not use 3 channels (stdin,stdout,stderr) for RPC, logging and progress infos. while ssh and similar methods can do that, e.g. a socket has only 2 channels. #7607
- DONE implement socket in addition to ssh to connect to
borg serve
. #7615
2.0 cli
- DONE separate archive name from repo url (our regexes are way too complex), #948
- DONE drop scp syntax for the repo location #6691
- DONE #6269
- #6756
2.0 locking
- DONE implement borgstore based locking for borgstore based repos.
- DONE stale lock removal of old locks that did not get refreshed.
- DONE stale lock removal of locks of dead processes.
- DONE most commands now use a shared lock, except borg compact and borg check.
y2038 and requiring 64bit
- if we set SUPPORT_32BIT_PLATFORMS = False, the y2038 issue will be solved (AFAIK), but we require a 64bit platform then.
- not sure if we can already do that. a lot of platforms already dropped 32bit support, but for some this is still in the works (e.g. SBC like the raspberry pi).
- otoh, development of borg 2.0 will take a while, so there's a good chance all 32bit platforms are gone when it will be released. and even if not, borg 1.2 will still exist also.
stuff that is out of scope
as you see above, there is already a huge scope of what should be done.
to not grow the scope even further, some stuff shall not be done (now):
- no public key cryptography (neither by gpg nor by reinventing gpg for borg)
- no multithreading
I do not mind breaking for the better at all, but some of the outlined details do not qualify for that IMHO.
When it comes to crypto, breakage should not occur to replace one algorithm with a limited life span with another one with a limited life span and thus planning with breakage every few years. Instead breakage should be done to end up with a repo format that does support multiple algorithms and easy and feasable changing of keys as well as used algorithms. That could e.g. be by at least temporarily allowing multiple algorithms to be "active" in a repo at the same time.
When it comes to repo format, a breakage should not be the excuse to just dump a bit of code to still support reading PUTs besides PUT2s, but question the format as a whole and try to address issues such as the current limitations of append-only as well as secure multi-client usage, infeasible (with huge repos) compaction. Ideas here would be:
- split into server managed (optionally hard add-only) content chunk "pool" and a per-client (separate crypto) meta-data stores (that could or could not be chunk based).
- consider use of public-key crypto especially in such a split storage model, where clients do not have the secret key that would be needed to read back chunks from the content pool, but only the one for the meta-data store, with the meta-data being encrypted to both that one and one of the borg server for the meta-data, so that it can both use the meta-data for e.g. compaction of the content chunk pool and auditing of client behavior (limit access to content chunks to those the client does have "written" before). The devil is in the details, but again, just ideas.
- evaluate radically repo formats, e.g. hierarchy of nested directories plus files named by chunk-id (prefix), leading in the extreme to a segment-less format where chunk existence is a single stat, chunk deletion a single rm of a single file and reading a chunk would be simply entirely reading a single file, or in less extreme cases (should practical testing reveal that filesystems would be a limiting factor) to segments that are not filled one after another, but collecting chunks with the same prefix. The extreme case would on the plus side lend itself perfectly to object storage backends, which a people keep asking for. (Both these could turn out to be terrible ideas, but they and other wild ideas should be considered and looked into to make sure a completely breaking repo format is one to be kept for a long time and worth the breakage)
When it comes to compression, what really should go is the auto mode - or be reimplemented with useful parameters, whcih IMO are hard to come up with in the light of ZSTD performance.
About "scp syntax": On the one hand I think it does not matter much, any sane setup does have wrapper scripts around it to make you only ever see and use the repo URL once in the life of the repo. On the other hand, given the use in scp/rysnc etc. making that non-URL syntax so much more common to users, plus that while the code handling things leaves a lot room for improvement, a lot of that has nothing to do with the non-URL syntax as such.
Crypto:
AES-CTR does not have a limited timespan. Why we are doing this is to get rid of the fundamental counter management issues:
- you can't lose the local counter memory and not trust (and continue to use) the remote repo.
- also you can't use multiple clients for one repo and not trust the repo.
There's also a slight ugliness of only storing a part of the IV within the old format, but that is just a minor detail.
The new AEAD algorithms with session keys solve that.
We could have all 3 crypto algorithms in parallel in the borg code (but currently not in same repo), but there are other things on the above list that are best solved with tar-export/import or borg transfer and a new repo and IF ones does that anyway, one can as well go for the better crypto in one go (instead of having to do the export/import again some time later).
I don't think it would be a good idea to use different encryption algorithms in the same repo and especially not with the same key - so if we would go for the complexity of supporting repos with that, we would need multiple (master) keys for one repo, making it more complex for borg and also for the users.
You also can't just "change the keys / algos" in the same repo. Due to dedup, a lot of data would be still encrypted by old key and old algorithm. To get really rid of it you'ld need some global migration, touching a lot of data and needing some management for the case of interruptions of that process. That's about as much I/O and time needed as the export/import, just with much more complexity.
Repository:
It's not just about the "reading PUTs" - it is at quite some places, including borg check (which is already quite complex).
I can imagine doing some more and even radical changes to the repo format if we re-start with new repos and require export/import anyway. I am not too happy with the complexities of segment file handling either.
In the end this will depend on some developers architecting and implementing it though and we should try to not make the scope too big though or it'll never get releasable.
Repos: interesting ideas. Needs more analysis I guess, esp. since we likely want to keep the transactional behaviour and maybe also the LOG like behaviour.
Segmentless repos: if everybody had a great repo filesystem and enough storage, I guess that could be done (but it would mean that if the source has a million files, the repo could have XX million chunks). Super simple for borg, but a huge load on the repo fs (did that within my zborg experiment back then). Could also be quite slower due to more random accesses and more file opening and use a lot more space due to fs allocation overheads if one has a significant amount of small files.
Cloud storage: I don't want to maintain such code myself, that's just a rabbit hole I don't want to get into. So, for me it is "local directory" as the repo (plus some method of remoting that, not necessarily the hard to debug current remote.py code).
Compression: auto mode should go? do we have a ticket about that?
@elho thanks for the detailled feedback btw!
This ticket is primarily meant for the to-break-or-not-to-break decision. Once we decide to do a breaking release, requiring new repos, key, export/import, we can do a lot of changes and need to discuss the details in more specific tickets.
We should somehow try to limit the scope though, so it won't take forever.
@ThomasWaldmann if instead of segments something like git pack's could be used, then with the new encryption session stuff it may even turn feasible to push packs instead of archives between repos without necessarily requiring de/encryption
Potentially this would also enable potentially dumb remotes like s3, sshfs, with the caveat of having more pain with post prune gc and repacking
@RonnyPfannschmidt encrypted chunks can be transferred between related repos using the same key material, there is a ticket about that already. I don't know the git pack format, so not sure how that is relevant for (re-)encrypting. But if we want to transfer a full "pack", there might be requirements due to that (opposed to just transferring a single chunk).
I would be happy with a borg1.3 that on first use of serve on (or direct local access to) a v1 repo would start out (maybe after some confirmation) by iterating over all segments, for each creating a new replacement segment file, filling it with the same content except for using PUT2 whenever a PUT is read from the old one, doing some sort of verify pass maknig sure the new segment as arrived on disk has the same data as the old one and only then atomically mv the new over the old one. When having done the last segment file without being interrupted, switch repo version from v1 to v2. No other command or code path would need to support v1 and PUT in that scenario.
Note: I updated the topmost post with feedback from you all (thanks!) and also with new insights. I also edited some other posts to remove duplicate / outdated information to keep this issue short.
Progress in #6663 and #6668 looks quite good.
About version: if we require people to transfer their repos using borg transfer
, guess that must be borg 2.0 because you can't just continue with an existing repo as it is.
So, if we merge these, next release from master will not be 1.3, but 2.0.
not sure if we can already do that. a lot of platforms already dropped 32bit support, but for some this is still in the works (e.g. SBC like the raspberry pi).
I think especially SBCs will stay 32bit for a while, because the savings in having a smaller pointer width are relevant on low-memory platforms.
Aren't there clock system calls which return a 64-bit wide integer even on 32-bit ABIs?
Well, it's not just like borg needs to get the 64bit time by doing a call, it rather is the whole system of kernel / libc / python needing to work with timestamps of reasonable length. E.g. timestamps in os.stat
output, python time
and datetime
stuff, etc.
So, if we merge these, next release from master will not be 1.3, but 2.0.
Changing the module name from borg
to borg2
at this point is something to be thoroughly considered.
Both, to eventually play with potential (meanwhile obsoleted already) export/import tar magic, but also to be able to test 1.2 in parallel with 1.1 in production across all my systems in a sane manner, I went on the surprisingly painful adventure to create myself a variant of the distribution's package that can be installed and used in parallel with the stock 1.1 one. In a hackish manner, one could install borg below a different path, but that is nothing any distribution would do, I went the painful way to do such a rename in there. (IOW happy to clean that up and even break out some of the cases where absolute imports were used without need and against the common practice in most other similar places in the code).
For the original idea of export-import migration this would be a requirement, here it is not, but in practice, for people backing up to multiple repos, scenarios like migrating the local one to 2.0 while still waiting an undefined time for the borg storage provider the external one resides on to support 2.0 could be very common.
Guess it is not just about the module name, but also the cli cmd name. OTOH, I'ld dislike to put the version number into the cli cmd name.
For testing, one could also use the fat binary and rename that to borg2.
Guess it is not just about the module name, but also the cli cmd name. OTOH, I'ld dislike to put the version number into the cli cmd name.
It is, but the command name is something that can just be changed without requiring any modification of the command itself to keep it working, and on the other hand is something distributions have support for.
E.g. in Debian, a borg2 package would ship borg2
etc. comnands, but (along with a packaging update to the 1.x version to be shipped in parallel) make use of the alternatives
system of managed symlinks to have borg
commands available to the user that point to whichever version is installed on its own, to (probably best for compatibility) borg1
if both are installed, with the option for the user to easy switch that (along with the corresponding manpages) according to his preference.
Aware wrappers that censequently have an idea of the configured repo(s) being version 1 or 2 would know to invoke according versioned command name in all cases.
For testing, one could also use the fat binary and rename that to borg2.
Testing as in "is this for me" or "does this work at all", yes. But not for testing as in "let me run this in parallel to 1.1 for a couple months and see whether any issues arise before ditching 1.1", ie. a point where 1.2 can be regarded to be at currently.
Well, it's not just like borg needs to get the 64bit time by doing a call, it rather is the whole system of kernel / libc / python needing to work with timestamps of reasonable length. E.g. timestamps in
os.stat
output, pythontime
anddatetime
stuff, etc.
The statx
syscall already has 64-bit wide timestamps (it uses __s64
for the seconds instead of time_t
). Since kernel 5.1, 64-bit wide time structs are available on a bunch of other system calls.
So the kernel can (probably; I saw patches for utimes64, not sure if those have been applied, it hasn't been mentioned in that post above) do it.
I'm not sure what the current status is on the glibc side of things (the page looks a bit unclear on progress), but it may be worth pushing python on 32bit architectures to use it if glibc is ready.
All I'm saying: don't drop support for 32-bit architectures, but go for dropping support for 32-bit timestamps, which don't have to be the same thing anymore this time and age.
Note: i updated the top post with the current progress and also released 2.0.0a3 - if no one is holding me back with negative testing results, I'll soon merge the borg2
branch into master
.
DONE remove libdeflate again and use zlib.crc32 from stdlib, PUT2 format only uses crc32 for header data, not much data getting crc'ed
You could eventually use zlib-ng as an alternative to zlib (for the prebuilt binaries), it has optimized CRC32 routines, that should be faster than zlib and hopefully similarly to libdeflate.
@FabioPedretti the point of using zlib is that python3 provides it, so there is no additional dependency for borg.
as there was no negative feedback from alpha testing, i just merged the borg2
branch into master
. 🚀
keeping this issue open until N+1 for the misc. remaining TODO.
Is there any overview of what borg2 improves for me as a user? How usable is it?
IIRC I did not write a short overview yet, so there's what you can read in the change log and in the top post of this ticket.
The super short overview is "we fixed most issues labelled as BREAKING", they often were long-term open issues (sometimes since attic) because fixing them breaks compatibility.
See there: https://github.com/borgbackup/borg/issues?q=label%3Abreaking
2.0.0b1 should be pretty usable, just do not run it against production repos (rather use copies to experiment).
@xeruf see there: https://github.com/borgbackup/borg/issues/6956
What about using zstd dictionaries to get the compression ratio up? :)
@RubenKelevra do you have an idea about how exactly would that work inside borg?
@ThomasWaldmann sure:
- Run a bunch of test files through
file
to determine the mime type. - Cut them in pieces with buzhash
- Save the pieces on the disk in folders named like the mime types
- Then let zstd create a dictionary for each mime type.
- Create a table with byte size numbers for each mime type
Rationale behind the last step is: zstd archives can select the used dictionary for decompression by a byte (as an minimum size identifier). Since on the block level it's probably pretty tricky to get the mime type before decompressing the file, it's probably best to let zstd choose the correct dictionary by an identifier stored in each block (takes up one byte).
This becomes important if the same data is found in different types of files. Say a tar archive contains blocks of a JSON file.
The mime type is in this case no longer helpful, but decompression is still possible.
@RubenKelevra well, I see what you mean, but that is not how "borg create" works.
But maybe check the issue tracker if we have a ticket about this and if not, create a new one, so we can collect ideas there.
@RubenKelevra well, I see what you mean, but that is not how "borg create" works.
Interesting, can you elaborate or point me to the part which is different than I think, so I can take a look? 🤔
But maybe check the issue tracker if we have a ticket about this and if not, create a new one, so we can collect ideas there.
Will do
Hello, i am a new user of borg. I need a usable backup by the end of December. Is the release of Borg 2.0 planned by the end of this year? If so i could start with it.