rdedup icon indicating copy to clipboard operation
rdedup copied to clipboard

What is the concurrent/remote story?

Open jgoerzen opened this issue 5 years ago • 6 comments

Hi,

First, than's for rdedup! This looks like a fantastic project to meet some real needs.

I would be happy to write some documentation on this if you can point me in the right direction.

I'm wondering two related questions.

First, is concurrent access to the repository allowed, and if so, in what ways? Can two processes write to it at once? Is any locking done? This is relevant for consolidating backups from multiple hosts to a single backup host.

Secondly, I see that cloud storage is WIP, which is fine. I'm wondering what exactly rdedup needs from its underlying filesystem, with an aim to evaluating whether it can run atop the various FUSE remotes; anything from sshfs to the S3-based ones, etc.

Thanks!

  • John

jgoerzen avatar Jan 03 '21 05:01 jgoerzen

Hi @jgoerzen, not exactly answering your question but since you're interested in remote storages, see my recent PR: https://github.com/dpc/rdedup/pull/184

BTW, after you check the linked code, you will understand that rdedup has solved concurrency quite well - the backend has a two-level locking.

jendakol avatar Jan 03 '21 19:01 jendakol

Quick relevant link: https://github.com/dpc/rdedup/wiki/Rust's-fearless-concurrency-in-rdedup

IIRC, the whole backend storage is protected by a sort of a read-write lock, and most operations (in particular adding new data) takes a shared lock https://github.com/dpc/rdedup/blob/8f9c76772e46c23e67d0189552c525dd2814b9c3/lib/src/lib.rs#L822 .This is thanks to everything being idempotent and content-addressable. You can read & write at will, and any overwritten file etc. is going to have exactly same content every time, so there's no problem.

Notably, removing data (mostly garbage collection) takes an exclusive lock.

dpc avatar Jan 04 '21 01:01 dpc

A backend is anything that can implement these basic interfaces https://github.com/dpc/rdedup/blob/8f9c76772e46c23e67d0189552c525dd2814b9c3/lib/src/aio/backend.rs#L32

dpc avatar Jan 04 '21 01:01 dpc

Notably, removing data (mostly garbage collection) takes an exclusive lock.

@dpc Thanks for pointing that out, i already wondered about that. So starting a GC blocks everything else until done? Or finer granularity? Is the case that one backup writes a chunk and another one GC's it, covered? (source pointer appreciated ;-) What if the repo is remote (iirc on S3-like remotes writes are not instantly so locking may not even possible)? DO you know the duplicacy 2-step process?

geek-merlin avatar Jan 06 '21 21:01 geek-merlin

I think GC right now will block everything. Backend is irrelevant. From main logic perspective backends are only writing and loading requested files (kind of).

However GC can be stopped at any time without losing progress and then resumed, so I could imagine if long GC is a problem it could be put behind a timeout rdedup gc ... or something and run periodically. Probably could be implemented with finer granularity, but I never investigated it.

I skimmed at https://github.com/gilbertchen/duplicacy/wiki/Lock-Free-Deduplication#two-step-fossil-collection and rdedup is doing the exact opposite, mostly due to idempotency design and support for concurrent writes synced through dumb syncing mechanism dropbox/syncthing.

The GC works by creating another "generation" folder, then stored-name by stored-name rewriting (moving) all the chunks to the new generation. After all names have been moved from the past generation to new generation, the leftover data chunks in the previous generations are deleted (after some reasonably long time has passed, to make sure any concurrent writers had time for dropbox/syncthing to sync) as they are clearly not referenced by anything. This should be fine as long as the renames are not very expensive (which is not always the case - eg. Backblaze B2 had no support for rename operation at the time).

dpc avatar Jan 06 '21 22:01 dpc

Ah amazing! Reading old tickets gives me a picture: #75 #32 #132 #37 Relevant to me: #172

geek-merlin avatar Jan 06 '21 23:01 geek-merlin