borg
borg copied to clipboard
Borg backup to Amazon S3 on FUSE?
Hi everyone,
I'm interested in using Borg to backup my webserver to an Amazon S3 bucket. I've been using Duplicity, but I'm sick of the full/incremental model, as well as the difficulty of pruning backups. I love the ease of use and features that Borg provides, but I don't really understand the internals and I'm not sure if it will work with Amazon S3 storage.
Specifically, I'm considering mounting my S3 bucket over FUSE, using one of the following three options:
- https://github.com/s3fs-fuse/s3fs-fuse/wiki/Fuse-Over-Amazon
- https://github.com/archiecobbs/s3backer
- https://github.com/danilop/yas3fs/wiki
Any comments on which, if any would be more appropriate? And how tolerant would Borg be to S3's "eventual consistency" weirdness?
Additionally, I need to plan against the worst-case scenario of a hacker getting root access to my server and deleting the backups on S3 using the stored credentials on my server. To eliminate this possibility, I was thinking about enabling S3 versioning on the bucket so that files deleted with my server's S3 user account can still be recovered via my main Amazon user account. Then, I would have S3 lifecycle management configured to delete all versions of deleted files after X amount of time. In this case,
- How much of my S3 data would Borg routinely need to download in order to figure out which files have changed and need to be backup up? (I'm worried about bandwidth costs.)
- How much accumulated clutter and wasted space could I expect from files that Borg "deletes", (which will actually be retained on S3 due to the versioning)?
Again, my concerns are based on me not really understanding all the black magic that happens with all the chunks and indexes inside a Borg repository, and how much they change from one backup to the next.
Thanks in advance for the help!
:moneybag: there is a bounty for this
I'm still trying to get an idea of what exactly happens in the Borg repo from one run to the next. I used it to backup my ~/ directory (about 72GB on disk) last night, and I messed around with creating and deleting files and re-combining ISO images to see how well the de-dupe works. (It works extremely well, I might add!) I ran around 30 backups with no pruning. That was last night, and then today I used my computer for some web browsing and then ran another backup with a before and after ls -sl
on the repo/data/1
directory . Here's a diff of repo/data/1
before and after:
http://paste.ubuntu.com/11910814/
(1 chunk deleted, 4 added, total change of 5)
Then I pruned all but the most recent backup and ran another diff:
http://paste.ubuntu.com/11910824/
And here's the repo/data/0
directory, just the names of deleted files:
http://paste.ubuntu.com/11910839/
(580 chunks deleted, 75 added, total change of 655)
So assuming that all the chunks are around 5MB, that would be around 3GB of deleted data taking up wasted space in Amazon S3, which would cost me about $0.05/month in Glacier according to Amazon's calculator, and it would have to stay there for 90 days to avoid a penalty. Or else in regular S3 storage it would cost something like $0.11/month. Additionally there would be far fewer changes and much less total data stored in the case of my webserver I want to back up with this scheme.
So I would tentatively think this could be a good option?
I might add that you can get 10 TB (thats ten terrabyte) as "nearly" OpenStack Swift compatible storage from HubiC.com for 50 Euro a year (no kidding). I use this together with my Hubic Swift Gateway and the swift duplicity back end.
This also is EU storage (located in france) which solves some problems with German laws.
I also think that it is fairly easy to implement as backend for software with a chunked approach.
P.S.: Their desktop client (still) sucks imho... but you even get 25 GB for free. Which can also be used for experiments with the API.
Thanks @oderwat for the tip! Good to know.
I must say that I don't use "cloud data storage services", so I can't advise about their API/capabilities.
Borg's backend is similar to a key/value storage and segment files only get created/written, but not modified (except from complete segment files being deleted), so it could be possible if someone writes such a backend.
Borg has an "internals" doc that might be interesting for anybody wanting to write such a backend. If information is missing there, please file a docs issue here.
borg has some level of abstraction of remote repositories... there's currently only a single RemoteRepository
implementation, and it hardcodes ssh
in a bunch of place. we nevertheless have a list of methods we use in RPC calls that would need to be defined more clearly, maybe cleaned up, and then implemented in such a new implementation:
rpc_methods = (
'__len__',
'check',
'commit',
'delete',
'destroy',
'get',
'list',
'negotiate',
'open',
'put',
'repair',
'rollback',
'save_key',
'load_key',
)
this list is from remote.py
, and is passed through the SSH pipe during communication with the borg serve
command...
notice the similar issue in https://github.com/jborg/attic/issues/136
Supporting storage services like AWS S3 would be huge and make borg a real alternative to tools like tarsnap. I would support a bounty for a) generic storage interface layer b) and S3 support based on it. I suggest libcloud https://libcloud.readthedocs.org/en/latest/storage/supported_providers.html to design interfaces/deal with cloud storage services.
Another interesting backend storage might be sftp/scp, as provided by some traditional hosting providers, like Hetzner or Strato HiDrive
@rmoriz your contribution would of course be welcome. bounties are organised on bountysource, in this case: https://www.bountysource.com/issues/24578298-borg-backup-to-amazon-s3-on-fuse
the main problem with S3 and other cloud providers is that we can't run native code on the other side, which we currently expect for remote server support. our remote server support involves calling fairly high-level functions like check
on the remote side, which can't possibly be implemented directly in the native S3 API: we'd need to treat those as different remotes. see also https://github.com/borgbackup/borg/issues/191#issuecomment-145749312 about this...
the assumptions we make about the remotes also imply that the current good performance we get on SSH-based remotes would be affected by "dumb" remotes like key/object value storage. see also https://github.com/borgbackup/borg/issues/36#issuecomment-145918610 for this.
Please correct my if I'm wrong.
It looks like we have/need a three-tier architecture:
- borg client
- borg server (via ssh)
- (dumb) storage.
So the borg server part needs a storage abstraction model where backends like S3, ftps, Google Cloud Storage, etc. can be added.
Is that correct? I think using FUSE adapters are not a reliable way (IMHO).
Update:
- bounty added: https://www.bountysource.com/issues/24578298-borg-backup-to-amazon-s3-on-fuse
- @RonnyPfannschmidt that solution would be even better. I would love to see this happen.
- please discuss possible implementations with the maintainers before starting work. Thank you.
the server is not necessaryly needed
borgs internal structure would allow to use something like a different k/v store as well - but someone needs to do and test it
Thanks for putting a bounty on this.
If someone wants to take it: please discuss implementation here beforehands, do not work in the dark.
+1 for me on this. I want exactly what the original poster is talking about. Also since I am worrying about deduplicating I want to use some really highly durable storage like amazon has. Also the versioning life-cycles to protect against the "compromised" host problem would be fantastic... (I added to the bounty :) )
I've written up some of my thoughts on some of the limitations of s3, and a WIP discussion about some possible method to address them. It is organised as a single document right now, but as it flushes out, I will expand it as appropriate. Please comment there and I will try and keep the document up to date with as much information as possible. see https://gist.github.com/asteadman/bd79833a325df0776810
Any feedback is appreciated. Thank you.
the problematic points (as you have partly noticed already):
- using 1 file per chunk is not gonna work practically - too many chunks, too much overhead. you have to consider that 1 chunk is not just the usual 64kiB (or soon: 1MiB) target chunk size, but can be way smaller if the input file is smaller. you can't really ignore that in the end, this is something that has to be solved.
- the archive metadata (list of all files, metadata of files, chunk lists) can be quite large, so you won't be able / you won't want to store this in one piece. borg currently runs this metadata stream through chunker / deduplication also, which is quite nice because we always have the full(!) item list there and a lot of it is not changing usually.
- "skipping chunks that already exist" - if you want to do that quickly, you need an up-to-date (consistent) local index / hash table. otherwise, you may have 1 network roundtrip per chunk.
- that "eventually consistent" S3 property is scary. it's already hard enough to design such a system without that property.
- "chunk staleness" is an interesting idea. but i think you could run into race conditions - e.g. you just decided that this 3 months old chunk shall be killed, when a parallel backup task decided to use it again. guess either atomicity or locking is needed here.
Yes, target chunk size in 1.0 will be 1 or 2MiB. That doesn't mean that there will be no tiny chunks - if you file only has 1 byte, it will be still 1 chunk. So, the average might be lower than the target size.
BTW, it is still unclear to me how you want to work without locking, with parallelel operations allowed (including deletion). I also do not think that making this github issue longer and longer with back-and-forth discussion posts is helping here very much - if we want to implement this, we need ONE relatively formal description of how it works (not many pages in discussion mode).
So I'ld suggest you please rather edit one of your posts and update it as needed until it implements everything needed or until we find it can't be implemented. Also, the other posts (including mine) should be removed after integration. I am also not sure a gh issue is the best for that, maybe a github repo, where one can see diffs and history would be better.
http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-aws.html doesn't sound too promising about the possibility of reliably using S3 directly from a backup tool (he wrote a special server that sits between the backup client and S3).
@TW that post was from 2008… https://aws.amazon.com/de/s3/faqs/#How_durable_is_Amazon_S3
@ThomasWaldmann - actually its promising - it's not too different from what borg is already doing in the local format - and it might not need too much of a change to make borg work against it
Don't forget BackBlaze's B2. Cheapest storage around. Hashbackup already does all that but it's closed source so who knows how that is done.
Amazon Cloud Drive offers unlimited storage for just 50$ a year. Would be great if it'd be supported! :)
There's a FUSE FS for it: https://github.com/yadayada/acd_cli
That should work okayish (maybe not the best performance).
This thread here is about directly using the S3 key-value store as a backup target (no intermediate FS layer), at least that's how I understand it.
I think it's kinda unrealistic, at least for now, to completely redo the Repository layer. An alternative Repository implementation could be possible, but I don't see how you could do reliable locking with only S3 as the IPC, when it explicitly states that all operations are only eventually consistent. Parallel operation might be possible, but really, it's not a good idea for a first impl. Also, Repository works only on a chunk-level, and most chunks are very small. That just won't work. (As mentioned above)
Working on the LoggedIO level (i.e. alternate implementation of that, which doesn't store segments in the FS, but S3) sounds more promising to me (but - eventual consistency, so the Repository index must be both local and remote, i.e. remote updated after a successful local transaction, so we will actually need to re-implement both LoggedIO and Repository).
Locking: Either external (e.g. simple(!) database. Are there ACID RESTful databases, those wouldn't need a lot of code or external deps?) or "User promise locking" (i.e. 'Yes dear Borg, I won't run things in parallel').
Eventual consistency: Put last (id_hash(Manifest), timestamp) in locking storage or local, refuse to operate if Manifest of S3 isn't ==?
For what it's worth, I'm currently using borg
on top of a Hubic FUSE-based filesystem for my off-site backups. It's painfully slow - my net effective writing speed is around only 1 Mb/s - but other than that works pretty well.
Issues as I see them
- Writes have a very high latency. Once you're writing it's fast (10 Mb/s, intentionally limited within Hubic), but there seems to be a two second delay at the beginning of each file write.
- Reads are reasonably fast. There's certainly nothing like the write latency but I've yet to turn this from an empirical value into a quantifiable one.
- The process is slow, so avoiding inter-feature locking would be a very good thing. (
borg list
, andborg extract
, specifically).
It might help to cache KV updates locally before writing them in a blast periodically, But I don't have any easy way of testing this. (It would be nice if there were a generic FUSE caching layer, but I have not been able to find one.)
Increasing the segment size in the repo config might help if there is a long-ish ramp-up period for uploads. (And increasing filesystem level buffer sizes if possible)
http://rclone.org/ maybe interesting as component for the cloud support plan.
here's the most original solution I have heard yet for "cloud" backups with borg:
https://juliank.wordpress.com/2016/05/11/backing-up-with-borg-and-git-annex/
TL;DR: backup locally, then use git-annex (!) to backup to... well, anything. in this case, a webdav server, but yeah, git-annex supports pretty much anything (including rclone) and can watch over directories. I'm surprised this works at all!
Yeah so I've already went down the git-annex route through research and testing and it's extremely complicated. The way your suggesting is really dirty and tedious....git-annex is a whole other beast to learn. Really, users of Borg could already just rclone their backups to whatever cloud provider is supported by rclone (most of them). You'd only need to add git-annex if you're looking for even more versioning and/or encryption.
On Wed, May 11, 2016 at 11:40 AM, anarcat [email protected] wrote:
here's the most original solution I have heard yet for "cloud" backups with borg:
https://juliank.wordpress.com/2016/05/11/backing-up-with-borg-and-git-annex/
TL;DR: backup locally, then use git-annex (!) to backup to... well, anything. in this case, a webdav server, but yeah, git-annex supports pretty much anything (including rclone) and can watch over directories. I'm surprised this works at all!
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/borgbackup/borg/issues/102#issuecomment-218499047
Mario P. Loria (586) 258-6003 Co-Founder, Arroyo Networks https://arroyonetworks.com/
[email protected] https://www.linkedin.com/in/mario-loria-32514b10a https://twitter.com/marioploria https://plus.google.com/+MarioLoria/posts https://github.com/InAnimaTe
Well, that was the whole point wasn't it, encryption... Then again, it's unclear to me why they didn't use the built-in encryption.
The whole setup, even with just rclone, also has the problem that you have a local repository which takes local disk space. Obviously, this is not a complete solution for the problem here, but I thought it would be interesting to share nonetheless.
So.... if I rclone
the entire borg repo to my favorite cloud storage provider, and then I later want to restore something, do I have to re-download the entire repo? And what if one or two "chunks" get corrupted, can I still recover the rest?
https://github.com/gilbertchen/duplicacy-beta
It looks like duplicacy has most or all of the same features as borg-backup, and it also supports backing up to cloud storage like Amazon S3. Unfortunately, it is not currently open-source, and the development seems to happen behind closed doors.
The developer does share a design document so it is possible to get a general idea how it works. If I understand it correctly, the reason duplicacy is able to work with cloud storage is because it does not have a specialized index or database to keep track of chunks, but rather uses the filesystem and names the files/chunks by their hash.
It's super-shady that they're using github while not being open source.