duplicacy
duplicacy copied to clipboard
Privacy for multiple users backing up to shared repo
Feature request / topic for discussion.
I would like to be able to backup multiple computers/users a single shared encrypted target in order to maximize de-duplication. I would prefer however that individual users would only have access to their own backup set for restoration purposes. As i understand it this is currently not possible with the architecture of duplicacy. The idea here would not be to protect against a malicious or determined attacker, since ideally you wouldn't let them backup to your repo anyway. Rather this would be designed to keep prying eyes, even those willing to drop into the CLI, from snooping on files that are not theirs to see. Ideally i would like to be able to have several generations, parents, grandparents, techsavy teen children, as well as possibly close family and friends backup to the same shared de-duplicated location with some amount of privacy.
The problem is, as you stated, you want the benefit of a single repository for de-duplication benefits, but that also means sharing the same encryption keys. Using different encryption keys would result in a different hash for the same data, thereby eliminating the benefit of de-duplication, etc.
Perhaps @gilbertchen might have other thoughts, but as I understand the architecture, this simply wouldn't be possible.
I have been mulling over what might be a way to make it happen in duplicacy. Currently duplicacy uses a handful of different keys to encrypt the various types of data it stores as part of a backup. In order to backup/restore/and delete/prune the storage duplicacy needs all of the keys to effectively de-duplicate chunks and ensure that only chunks not referenced by any existing snapshots are fossilized and then deleted. My idea would be to add another type of key to the storage alongside the existing keys. The new key would be known only to the user that created a backup with it. The backup process would create and encrypt file chunks and hashes just as it does now. Snapshot files however would change somewhat, rather than have a single file per snapshot, there would now be two snapshot files. The first file would be the equivalent of the existing snapshot file type. It would list all the files in the backup and the chunks which make them up. Rather than being encrypted with the existing global backup key, this first file would be encrypted with the users private backup key. A second snapshot file would then be created and encrypted with the existing shared key. This file would list two files within the backup. The first would be the private snapshot file and the second would be a virtual file containing all of the chunks referenced by the snapshot in a random or pseudo-random-per-user order. The idea is that operations on the storage could use this public snapshot file to determine which chunks were referenced by the backup, but it would be difficult (though not impossible) to reassemble the chunks into a meaningful files.
When I hear "global backup key" and "difficult, but not impossible to reassemble" I start to worry about how secure this can be made. If any data could become shared at any time it means any user of the shared space can access the data at any time.
I suppose you could base the data encryption key solely on the data itself and each user encrypts that key with their private key. The data would need to be stored with a clear name, which exposes content information. I'm also not sure how secure it is to encrypt with a key based on the message content. It doesn't seem sound at first glance.
Then there's still the issue that no matter what you do, a user can identify data that they share with someone else just by noticing what is deduplicated.
As I mentioned in the original request, I am not looking for a high security option here. I understand that such a request is likely unfeasible. I am not looking for a way to allow multiple nefarious users from backing up to the same repo, but rather a way to discourage trusted users from poking around, or inadvertently stumbling through snapshots which they did not themselves create. The method I outlined above would still rely on the existing encryption to prevent non-users from accessing data. When i spoke of "global backup keys" I was referring to what we have now, that is a single set of keys used to encrypt the backup for all users sharing a storage.
Ah, true if you're not looking for good security this does get easier. Though, I wonder if it would be enough for you to just use the same key, but different repo names. You did mention tech savy children so I think you do need to plan for malicious attackers ;-). From a development effort stance I'm not sure how much time it would be worth putting in for something that isn't really providing any security.
The strong way to do this is with a trusted service that handles the file encryption and generating keys for each block. That service can control access to the derived keys via the user keys. But, that requires putting the encryption on the server or some other computer that is accessible to all users.
Maybe you could share a master key, derive per-chunk keys from the chunk hash, each user encrypts the chunk key with their user key, the encrypted chunk is named by a different hash function.
- A user can't read data they don't already have access to
- A user will know if another user has their same data
- I'm not sure how secure this is
# never stored anywhere
chunk_hash = hash(chunk)
chunk_key = derive(shared_key, chunk_hash)
# uploaded
chunk_name = hash(chunk_hash)
encrypted_chunk = encrypt(chunk, chunk_key)
encrypted_chunk_key = encrypt(chunk_key, user_key)
I think I may have misused duplicacy's terminology in my comments above. I realize now that when i was saying repository I was really referring to what duplicacy calls a "Storage". I am not sure I understand this comment.
Though, I wonder if it would be enough for you to just use the same key, but different repo names.
My understanding was, and the reason for making this request, that multiple named "repos" backing up to the same "Storage", would all share the same encryption for de-duplication purposes. In that case any user could list and restore snapshots from any "repo" within the "Storage" using the CLI with little effort.
The idea I presented above was supposed to prevent users using duplicacy software, CMD Line or GUi from being able to restore "repos" that they did not create. and make it inconvenient to try and piece together a snapshot outside of duplicacy. This was further based on the idea that given a large number of chunk files and their hashes, without a snapshot relating chunk hashes to files, it would be difficult to re-assemble them into their original order.
Yeah I get that and I think it's probably a good idea to keep the snapshot file encrypted. And I understand the issue you're trying to overcome.
What I'm say though is that "difficult to re-assemble" isn't good enough. I can envision an algorithm that takes all the available chunks, decrypts them and then iteratively looks for chunks that fit together. When a match is found the search space decreases. I don't think it'd be as difficult as it sounds. This may be even easier with multiple snapshots as the chunk stream becomes shifted and takes a bit before boundaries match up with already uploaded chunks; there would be overlapping chunks to aid reconstruction. Further, duplicacy chunks are likely to be around 4MB in size. that's a decent amount of information to extract without even bothering to piece chunks together.
There are certainly ways to achiev what you're looking for, but I don't think it's worth pursuing if they compromise security, even if that's something you're personally willing to accept. But, I'm not the owner of the project. And there may even be secure ways to accomplish this.
but I don't think it's worth pursuing if they compromise security Agreed, and I certainly appreciate the your insights on this matter. My thinking was that the method i proposed above would not compromise security, though I admit my understanding of encryption is far from expert level. My thought was that from the perspective of an outside attacker, all the chunks, files, and hashes would still be encrypted and inaccessible (both with keys shared by all users of the Storage, and with a key unique to each individual user of the Storage). If the addition of a per-user repo key somehow compromises the integrity of overall Storage encryption scheme then my idea is for sure a no-go.
Yes, from the perspective of an outside attacker everything is still secure. From the perspective of each user though everyone can access the data of other users as the chunks themselves are encrypted with a key that everyone has access to. The per-user key in your scheme protects the snapshot, but not the chunk data. The scheme I've laid out may protect chunk data, but I'm not sure of how secure it is.
If you're ok with making it inconvenient for users to see the data of other users, your scheme is fine. I'm not convinced it would be that inconvenient though.
I think per-user encryption is not only possible but also very easy to implement. Currently Duplicacy already implemented convergent encryption for chunks. That is, chunks are encrypted using their own hashes. Snapshot files are encrypted differently, using the same fileKey
in the config
file. Therefore, we just need to introduce per-repository fileKey
to prevent a user from reading others' snapshots.
This scheme isn't susceptible to the 'brutal-force assemble' method, because you can't decrypt a chunk unless you know the hash of the chunk, and to know the hash of the chunk you need to decrypt the snapshot file first.
The only problem left is that without a global fileKey
none of those list, check and prune operations would work. I believe this problem can be solved by the feature requested in #330: when Duplicacy uploads the encrypted snapshot file, it should also upload another file contains the file names of all chunks that the snapshot file references. Then Duplicacy can run the check and prune commands without needing to decrypt all snapshot files first.
So there's not even a need for a shared key to derive the chunk-key from? That's cool. Hmm, doesn't that mean that anyone who gets access to a list of chunk names can determine their content by brute force? Say you wanted to prove that someone had a specific file. You could chunk and hash the entire file multiple times with offsets and look for those chunk names in the list. I thought the chunk names were encrypted as well, but that would mean multi-user de-duplication wouldn't work unless the encryption key was shared. Is that correct?
Anyway, the solution of #330 would be easy to extend to provide that per-snapshot chunk reference file for older snapshots as well. And un-encrypted snapshot files could be encrypted and re-uploaded too. This sounds like it's something that wouldn't break existing backups and could be turned on and off at any time, incurring only the cost of re-uploading the snapshot files; chunks would stay in place.
So there's not even a need for a shared key to derive the chunk-key from?
There is a shared key, hashKey
, to derive the HMAC hash of the chunk. This HMAC hash is then used as the encryption key for the chunk.
doesn't that mean that anyone who gets access to a list of chunk names can determine their content by brute force
The chunk name is derived from the chunk hash and idKey
. Without knowing idKey
you won't be able to generate chunk names. So only legitimate users who can access the config
file will be able to confirm the existence of a file. I don't think it is logically possible to prevent these users from performing the "confirmation of a file attack" when cross-user deduplication is allowed.
Using @fracai's notation, this is how shared keys (hashKey
, idKey
, and fileKey
) are used:
# for chunks
chunk_hash = hash(chunk, hashKey)
chunk_key = chunk_hash
chunk_name = derive(chunk_hash, idKey)
# for snapshots
snapshot_key = derive(snapshot_file_path, fileKey)
Got it. hashKey
is what I was missing. Thanks.
And yeah, confirmation of file is unavoidable unless you're going to prentend to upload data with every chunk, but that wastes bandwidth and time. File confirmation is a reasonable tradeoff here. If a user or admin is concerned about that they can either encrypt on their own first, or use non-shared space.
What is the status of this issue?
I'm not a fan of bumps, but I hope one every 2-3 years doesn't upset too many people :)
@gilbertchen is this idea / request tracked on some priority list ? If so, how high/low of priority does it have ?
Thank you !