btrbk icon indicating copy to clipboard operation
btrbk copied to clipboard

Usage with cloud storage like Amazon S3 or Glacier

Open vote539 opened this issue 7 years ago • 13 comments

I would like to set up a BTRFS filesystem with backups, and was happy to find this project. I would like to have the backups sent to a cloud storage solution, however, instead of a hard drive or SSH server. Most of these cloud storage solutions expose RESTful APIs, and you don't have control over the storage medium they use on their end.

Does btrbk support sending backups to an arbitrary REST interface?

vote539 avatar Jan 09 '17 09:01 vote539

Does btrbk support sending backups to an arbitrary REST interface?

No, this is neither implemented nor planned. If you want to push "target raw" backups to your amazon s3 storage, you need to somehow mount it locally. You could use s3fs for this, which should do exactly that. So your setup could be something like this:

  1. mount amazon s3 using s3fs to /mnt/mys3drive
  2. configure target raw /mnt/mys3drive/btrbk_backups/... in btrbk.conf

If you get this working, please post a note here, so that I could add a section for this on the FAQ.

digint avatar Jan 09 '17 12:01 digint

Thanks for the reply! Here's what ended up working for me. AWS Block Storage is the same price range per gigabyte as S3, so I created a block storage device and formatted it as BTRFS. I connected the block storage device to a "nano" head node, whose only job is to run btrbk. This setup gives me 500 GB of backup storage for about US$20/mo.

vote539 avatar Jan 17 '17 13:01 vote539

Amazon S3 is quite pricey if you are looking only for long-term archival. However, services such as Amazon Glacier don't seem to be easily mountable. It would be convenient if btrbk provided a target type for piping incremental backups into arbitrary commands. Think

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -

where {} would expand to the name of the file which is being passed on stdin and where the /usr/bin/glacier command originates from basak/glacier-cli. It seems trivial to just add

btrbk run && btrfs send -p `find snapshot_dir/ -mindepth 1 -maxdepth 1 | tail -2` |
  (insert a compression and encryption pipeline) |
  glacier archive upload my_vault --name=`ls snapshot_dir | tail -1`.btrfs -

to one's crontab and be done with it, but then you also need to keep a journal of unsuccessful uploads (due to the machine being offline for example), so that everything gets backed up eventually. This is not an unsurmountable task, but direct support for this kind of usage in btrbk would definitely be welcome.

Witiko avatar May 24 '17 18:05 Witiko

This is a nice idea, but it's incomplete: As btrbk is stateless, it always needs information of which subvolumes are already present on the target side. For target send-receive, this information is fetched by btrfs subvolume list; for target raw, the uuid's are encoded in the filename.

In order to complete this, we should define some data structure: timestamp, UUID, received-UUID, parent-UUID (similar to btrfs subvolume list), and then also have a user-defined command which would generate it. Then btrbk would parse this data and figure out which subvolumes needs to be sent to the target by the configured target_preserve policy, and which parents to pick for incremental send.

PS: sorry for the late reply, I'm really busy with other things at the moment...

digint avatar May 31 '17 19:05 digint

My original idea was that btrbk would be keeping tabs on the successful invocations to automatically infer which volumes need sending. If /usr/bin/glacier archive upload my_vault --name={} - from my example returned with a zero exit code, btrbk would put down {} to a list. Note that the user could specify where they want this list stored:

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -
    journal /var/lib/btrbk/glacier

Deleted subvolumes could be removed from the list, so that it does not grow ad infinitum.

Witiko avatar May 31 '17 20:05 Witiko

Yeah well, but then people start deleting files on the target by hand, and the mess with the journal starts...

I guess glacier also provides some sort of directory listing, so if btrbk would generate filenames the same way as it does for target raw, we could always fetch them and parse them the same way.

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -
      list_cmd /usr/bin/glacier <insert list command here> my_vault

digint avatar May 31 '17 20:05 digint

That would be /usr/bin/glacier archive list my_vault in this case. However, my idea was that the pipe target would be a fire-and-forget kind of a thing. If the user wants to start deleting data from the target, that is not our problem. Suppose I am just piping the data to a mail transfer agent over SMTP, or to a remote shell; I may well not be able to report on what is stored on “the other side”. I find this concept more flexible than what you propose.

P.S.: I guess target pipe is a little confusing name, as it implies that the target is a named pipe. Both target command and target pipeline resolve this ambiguity.

Witiko avatar May 31 '17 21:05 Witiko

However, my idea was that the pipe target would be a fire-and-forget kind of a thing

Yes I understand, and I see the benefit in this, but that's not how btrbk works. Maybe we could introduce a new sub-command for this kind of thing, something like btrbk oneshot, which would simply create a new snapshot and transfer it (always non-incremental) to the target. The main problem here would be to keep the config consistent and non-confusing. Maybe something like this:

volume /mnt/btr_pool
  subvolume home
    target pipe /usr/bin/glacier archive upload my_vault --name={} -
      target_type oneshot

digint avatar Jun 01 '17 10:06 digint

and transfer it (always non-incremental) to the target.

Note that keeping a journal would make it possible to transfer incremental backups even in this setting.

Witiko avatar Jun 01 '17 11:06 Witiko

s3fs

I've been trying to get this to work. There are a number of issues.

  • fuse is an operational burden, and docker doesn't help.
    • fuse in a docker container requires --cap-add SYS_ADMIN --device /dev/fuse, even if it's not exposed outside the container: https://github.com/docker/for-linux/issues/321
    • exposing fuse across containers requires special host config (mount --make-shared)
    • if a fuse app shuts down uncleanly, then its mountpoint becomes broken and requires a manual umount before it can be used again. Docker does not clean this up automatically.
    • It's not clear fuse issues will be resolved, because it's an inherent design mismatch. Requiring admin access and special configuration to do network storage is a non-starter.
  • s3fs is not production-quality
    • After weeks of testing, I haven't been able to use it to upload large files.
      • The latest release is broken:
        • https://github.com/s3fs-fuse/s3fs-fuse/issues/1941
        • https://github.com/s3fs-fuse/s3fs-fuse/issues/1936
      • Older releases don't support -o enable_content_md5, which is required for backblaze b2, and possibly others
    • s3fs' cache options do not play well with btrbk
      • s3fs has a metadata cache, but cat /s3/file; cat /s3/file will still issue two HeadObject requests. This is bad with btrbk as it reads all the *.info files on a raw target on every run.
      • s3fs will cache huge amounts of data to disk during file uploads, rather than streaming them
    • It's not clear s3fs issues will be resolved. Its codebase is undocumented, has heavy copy-paste duplication, uses non-meaningful naming schemes, and interlaces high-level business logic with utility functions. A large portion of it is dedicated to complex ad-hoc manipulations of a userspace cache. The design of this cache is questionable, and I certainly can't get it to perform well. Its user documentation is incoherent.

It would be a huge win if btrbk could use S3 APIs directly. Dozens of cloud providers expose an S3 API now.

The S3 API is a large surface though. Minimal S3 support probably still requires multiple signature versions and autodetection of multipart uploads, and likely other stuff.

In the meantime, I suggest btrbk.conf should offer a set of command endpoints, something like:

target pipe
  pipe_target_list_files /usr/local/bin/list_files_from_s3.sh my_bucket
  pipe_target_read_file /usr/local/bin/read_file_from_s3.sh my_bucket
  pipe_target_write_file /usr/local/bin/write_file_to_s3.sh my_bucket

The expected interactions would then be just like target raw, such that the scripts would be used to read and write *.info files in the same patterns currently used.

sbrudenell avatar May 01 '22 19:05 sbrudenell

Looking for a similar solution, just want to push an encrypted archive of a snapshot into a s3 long term storage such as https://www.ovhcloud.com/en-ca/public-cloud/cold-archive/. For 2$/month/TB its worth it ! I guess I can do it in another way, but directly integrated with btrbk is a must.

lpyparmentier avatar Jun 06 '23 15:06 lpyparmentier

Hm.. instead of implementing the whole S3 API ourselves or jumping the gun with custom scripts, how about adding rclone support for uploading and managing files? It seems like it has all the necessary commands, e.g.:

  • rclone rcat -- can be used to pipe directly into storage.
  • rclone lsf -- can be used to list current archives in storage.
  • rclone cat -- can be used to pipe directly out of storage.

The only downside is that rclone has its own config format.. that might make it messier than just allowing custom scripts.

bojidar-bg avatar Mar 06 '24 12:03 bojidar-bg

Shameless plug: my simple solution to this problem, https://github.com/kubrickfr/btrfs-send-to-s3

kubrickfr avatar Mar 28 '24 16:03 kubrickfr