Support Additional Backends?
Would anyone be interested in have rdedup natively support multiple backends?
Examples of a backends would be:
- Local Filesystem (current)
- Amazon S3
- Openstack Swift
- SFTP
- Webdav
- Google Drive
Just to name a few. Thoughts?
It would be great!
I personally use http://rclone.org/ to sync my backup to the cloud. Just write a script that first does the backup, then syncs it. rclone supports most if not all backends you mention @phillipCouto , and I can recommend it. I don't see any benefit of merging that functionality into rdedup.
The benefit would be that you don't need a local copy of your backup. It can talk directly to the backend so as data is written to the repo or read from the repo requests are made directly to the backend.
I see. That is true. In that case I'd look for a fuse level software that exposes the backends first. Eg. doing rdedup backup to --dir that is really an sshfs mount. That would be another way to get the same thing done, without adding it to rdedup.
I mean: this feature could be done, but it's going to be a huge amount of work. I expect adding support for each of the backends, would be the work amount similar to the rdedup itself. Though if done properly, and in a modular way, Rust community could reuse such code, so it does have it' own merits.
rdedup-lib could provide an abstraction over accessing the storage (it kind of is already there, though for internal logic like garbage collection), defaulting to std::fs operations. This would allow to use any swappable backends.
Yeah I was thinking of just abstracting it out and then only implementing the fs layer in the rdedup-lib. Any other backend would be a community contribution that others can implement.
Maybe I need to reword this issue to make it more clear as the intention was to create an interface to abstract the backend logic from the core rdedup logic like what you did for GC.
This is a great idea! My vote is for an abstraction layer + Google Drive. When creating the abstraction layer, could it support more than one destination? e.g. local filesystem + Google Drive. That would allow for a local cache in case of any immediate restores; with a network copy in case of a local disaster.
The question is at what level should the abstraction be: purely file-system, or more logical (chunks?).
When doing this abstraction, make it composable. Just like in slog-rs, the Drain trait is an abstraction that have a Duplicate implementation, which takes two other Drains, combine them together and write everything to both. This way it's easy to write logic that does: mirroring, failover, striping chunks between multiple providers and so on.
Does it make sense to start at the simplest form first which is just a single destination and then we can improve the abstraction past that? So right now abstract the file system away so it can be substituted with something else. Then once that has been tried and tested and we look at composing backends.
Also I think rdedup uses really a handful of operations from the file system anyway.
List Read Rename Remove Write
Seems to me it should be chunk-level abstraction. Some backends might not even have a clear file-system layer. Some will be a combination of local caching, and remote accesses and so on. I think spending some time on designing the abstraction right is worth it, even if later we might need to introduce some fixes.
It's perfectly fine to stat with abstraction + one implementation.
Makes sense looks like the next thing is to work on the spec for the abstraction =)
Yeah, so abstraction will basically look something like that though:
Rename is file system concept. Backend should deal with how it stores the chunks, and how it names them.
GC is rather important, let's spend some time to think how it would work in a setup when chunks are spread over remote storage.
Hmmm.... I guess rdedup will just fetch index chunks, read them, compare with list of files returned by each backend and issue delete commands. OK. Looks OK to me.
The other issue we need to figure out is locking. On Nov 30, 2016 15:21, Dawid Ciężarkiewicz [email protected] wrote:Yeah, so abstraction will basically look something like that though: Rename is file system concept. Backend should deal with how it stores the chunks, and how it names them. GC is rather important, let's spend some time to think how it would work in a setup when chunks are spread over remote storage.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
Repo does lock itself as whole with RWLock. If backends need to, they should do their own synchronization. Do you mean scenarios in which bakend is being used between multiple rdedup repos?
No I mean a backup to the same repo from multiple rdedup instances
Oh no. It's getting hairy. :D The way rdedup is currently designed is to support file-level synchronization using Syncthing/Dropbox.
The only place where locking is really a problem is GC, as one could schedule deleting the file, that concurrent backup assumed is still there. Otherwise atomic operations solve the problem of accessing not-fully written chunks, crashes, and so on.
I guess backends could be RWLock protected: the trait would have a LockRead, LockWrite, Unlock primitives or something like that, that would lock the whole backend store for duration of the operation. Underneath backend could write a temporary lock file to the storage or use other primitives supported by it.
That makes sense. It is also simple enough. Don't mean to be devils advocate but what happens in the event the locking rdedup instance fails and doesn't clean up correctly? =P
I was thinking about it. If backend can, it should do it in a way that will just work. If can't - some of timeouts should be used. Eg. lock file could have a timestamp that expires after an hour, and backend should keep updating it every 5 minutes or something like that.
Just FYI, rclone has a experimental command "mount" which can mount remote backends as a fuse directory. http://rclone.org/commands/rclone_mount/
It is going to happen. See #99 and #100.
An alternative could be the one that is currently being discussed in restic: https://github.com/restic/restic/issues/1561
rdedup would talk to rclone directly, without having a local copy. It would require modifications to rclone, but it may be possible to use the same ones needed for interacting with restic.
It may be a good time to jump into the discussion :-).
OK so we can leverage https://rclone.org/commands/rclone_serve_restic/
Here are instructions how to use rclone as a library.
(oh, and a title update may help, i opened #187, when i did not find this at first.)