rdedup Support Additional Backends?

Would anyone be interested in have rdedup natively support multiple backends?

Examples of a backends would be:

Local Filesystem (current)
Amazon S3
Openstack Swift
SFTP
Webdav
Google Drive

Just to name a few. Thoughts?

Nov 30 '16 13:11 phillipCouto

It would be great!

Nov 30 '16 16:11 legrostdg

I personally use http://rclone.org/ to sync my backup to the cloud. Just write a script that first does the backup, then syncs it. rclone supports most if not all backends you mention @phillipCouto , and I can recommend it. I don't see any benefit of merging that functionality into rdedup.

Nov 30 '16 18:11 dpc

The benefit would be that you don't need a local copy of your backup. It can talk directly to the backend so as data is written to the repo or read from the repo requests are made directly to the backend.

Nov 30 '16 18:11 phillipCouto

I see. That is true. In that case I'd look for a fuse level software that exposes the backends first. Eg. doing rdedup backup to --dir that is really an sshfs mount. That would be another way to get the same thing done, without adding it to rdedup.

I mean: this feature could be done, but it's going to be a huge amount of work. I expect adding support for each of the backends, would be the work amount similar to the rdedup itself. Though if done properly, and in a modular way, Rust community could reuse such code, so it does have it' own merits.

rdedup-lib could provide an abstraction over accessing the storage (it kind of is already there, though for internal logic like garbage collection), defaulting to std::fs operations. This would allow to use any swappable backends.

Nov 30 '16 19:11 dpc

Yeah I was thinking of just abstracting it out and then only implementing the fs layer in the rdedup-lib. Any other backend would be a community contribution that others can implement.

Maybe I need to reword this issue to make it more clear as the intention was to create an interface to abstract the backend logic from the core rdedup logic like what you did for GC.

Nov 30 '16 19:11 phillipCouto

This is a great idea! My vote is for an abstraction layer + Google Drive. When creating the abstraction layer, could it support more than one destination? e.g. local filesystem + Google Drive. That would allow for a local cache in case of any immediate restores; with a network copy in case of a local disaster.

Nov 30 '16 20:11 jpap

The question is at what level should the abstraction be: purely file-system, or more logical (chunks?).

When doing this abstraction, make it composable. Just like in slog-rs, the Drain trait is an abstraction that have a Duplicate implementation, which takes two other Drains, combine them together and write everything to both. This way it's easy to write logic that does: mirroring, failover, striping chunks between multiple providers and so on.

Nov 30 '16 20:11 dpc

Does it make sense to start at the simplest form first which is just a single destination and then we can improve the abstraction past that? So right now abstract the file system away so it can be substituted with something else. Then once that has been tried and tested and we look at composing backends.

Nov 30 '16 20:11 phillipCouto

Also I think rdedup uses really a handful of operations from the file system anyway.

List Read Rename Remove Write

Nov 30 '16 20:11 phillipCouto

Seems to me it should be chunk-level abstraction. Some backends might not even have a clear file-system layer. Some will be a combination of local caching, and remote accesses and so on. I think spending some time on designing the abstraction right is worth it, even if later we might need to introduce some fixes.

It's perfectly fine to stat with abstraction + one implementation.

Nov 30 '16 20:11 dpc

Makes sense looks like the next thing is to work on the spec for the abstraction =)

Nov 30 '16 20:11 phillipCouto

Yeah, so abstraction will basically look something like that though:

Rename is file system concept. Backend should deal with how it stores the chunks, and how it names them.

GC is rather important, let's spend some time to think how it would work in a setup when chunks are spread over remote storage.

Nov 30 '16 20:11 dpc

Hmmm.... I guess rdedup will just fetch index chunks, read them, compare with list of files returned by each backend and issue delete commands. OK. Looks OK to me.

Nov 30 '16 20:11 dpc

The other issue we need to figure out is locking. On Nov 30, 2016 15:21, Dawid Ciężarkiewicz [email protected] wrote:Yeah, so abstraction will basically look something like that though: Rename is file system concept. Backend should deal with how it stores the chunks, and how it names them. GC is rather important, let's spend some time to think how it would work in a setup when chunks are spread over remote storage.

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

Nov 30 '16 20:11 phillipCouto

Repo does lock itself as whole with RWLock. If backends need to, they should do their own synchronization. Do you mean scenarios in which bakend is being used between multiple rdedup repos?

Nov 30 '16 20:11 dpc

No I mean a backup to the same repo from multiple rdedup instances

Nov 30 '16 20:11 phillipCouto

Oh no. It's getting hairy. :D The way rdedup is currently designed is to support file-level synchronization using Syncthing/Dropbox.

The only place where locking is really a problem is GC, as one could schedule deleting the file, that concurrent backup assumed is still there. Otherwise atomic operations solve the problem of accessing not-fully written chunks, crashes, and so on.

I guess backends could be RWLock protected: the trait would have a LockRead, LockWrite, Unlock primitives or something like that, that would lock the whole backend store for duration of the operation. Underneath backend could write a temporary lock file to the storage or use other primitives supported by it.

Nov 30 '16 21:11 dpc

That makes sense. It is also simple enough. Don't mean to be devils advocate but what happens in the event the locking rdedup instance fails and doesn't clean up correctly? =P

Nov 30 '16 22:11 phillipCouto

I was thinking about it. If backend can, it should do it in a way that will just work. If can't - some of timeouts should be used. Eg. lock file could have a timestamp that expires after an hour, and backend should keep updating it every 5 minutes or something like that.

Nov 30 '16 22:11 dpc

Just FYI, rclone has a experimental command "mount" which can mount remote backends as a fuse directory. http://rclone.org/commands/rclone_mount/

Jan 11 '17 17:01 kcwu

It is going to happen. See #99 and #100.

May 08 '17 06:05 dpc

An alternative could be the one that is currently being discussed in restic: https://github.com/restic/restic/issues/1561

rdedup would talk to rclone directly, without having a local copy. It would require modifications to rclone, but it may be possible to use the same ones needed for interacting with restic.

It may be a good time to jump into the discussion :-).

Feb 05 '18 10:02 legrostdg

OK so we can leverage https://rclone.org/commands/rclone_serve_restic/

Jan 06 '21 23:01 geek-merlin

Here are instructions how to use rclone as a library.

(oh, and a title update may help, i opened #187, when i did not find this at first.)

Jan 12 '21 23:01 geek-merlin