conserve Built-in authenticated encryption of archive

Add encryption with user-managed keys.

This should encompass:

data blocks (to hide file contents)
indexes (to hide file names and other metadata)
block dir names (so an observer cannot tell whether a file with a known hash is stored)

https://github.com/sourcefrog/conserve/pull/191/files has some notes on a likely approach. However I actually don't want to implement this until some earlier format work is done.

Apr 28 '20 13:04 sourcefrog

May also add the option for asymmetric encrypted backups.
This allows the user to setup a daily backup job without having to store the private (e.g. secret) key on the machine.

Aug 03 '22 19:08 WolverinDEV

I'm actually a bit skeptical now about the value of doing this in Conserve rather than the underlying filesystem layer: either a local filesystem or for example S3...

Aug 04 '22 13:08 sourcefrog

May also add the option for asymmetric encrypted backups. This allows the user to setup a daily backup job without having to store the private (e.g. secret) key on the machine.

Can you explain this a bit more? The backup job needs to both read existing content, and store (so sign and encrypt) new content. I don't immediately see how it could not have a private key.

Aug 14 '22 15:08 sourcefrog

Some comments on the place of this on the roadmap:

I think this would be nice/interesting/useful/fun to add eventually, although as mentioned this can often be taken care of well by the storage layer.

Before adding encryption I would probably first want to shake out the inner format a bit more:

Do another revision of the base index format to make incremental backups more efficient. (https://github.com/sourcefrog/conserve/wiki/Next-format has some notes but is not final, and see also https://github.com/sourcefrog/conserve/labels/type%3Aformat-change.)
Merge SFTP support, add cloud object storage (#105), and assess performance with high-latency storage.

I think it is important that this starts with a design doc, not with code, since it's unfortunately common for encryption schemes to be ineffective in practice. This should talk about:

How are keys generated and stored, allowing for unattended backups, and also helping users ensure they still have the key to restore if the machine is totally lost.
What is the threat model and what properties is encryption trying to provide.
And then the actual encryption format.

Personally, in my spare time, I'm currently prioritizing getting https://github.com/sourcefrog/cargo-mutants to a state of reasonable completion, then I'll probably turn more attention back to Conserve in general.

Aug 14 '22 15:08 sourcefrog

Hello @sourcefrog, I somehow overlooked this issue after my initial suggestion of asymetric encryption.

I'm actually a bit skeptical now about the value of doing this [encryption] [...]

I agree that encryption can often be taken careby the storage layer but mostly this isn't trivial to set up. Setting up encryption on a storage layer basis might also not be possible. An example are cloud storage solutions where you don't have controller over the actual storage medium. I hope the other advantage, why encryption should be handled by conserve itself will become obvious bellow.

The backup job needs to both read existing content, and store (so sign and encrypt) new content.

I agree, that files must be encrypted during the backup process, but this can purely be achieved by using the public key. The additional signing of corresponding data is a good addition, but I think identity verification of the of the backups' creator is not the main concern.

But before going into implementation details I think the reason why using asymetric encryption could be quite beneficial hasn't become clear yet. Creating backups is mostly an automated task scheduled every once in a while (hopefully regularly!). Therefore I want to store the encryption key on the backup storage medium as well. Thus rendering the encrption useless. Having to manually trigger the backup every time isn't the best thing to do ether, just to provide the backup key / password. Note: Storing the backup key on the target which the backup has been made of isn't an option as in case of failure you'll loose access to said key and therefore to the backup as well.

Therefore using asymetric encryption would allow to easily create automated encrypted backups without the need of profiving the private key. Being a little more paranoid, you don't even have to store the private key anywhere digitally.

I hope the motivation has become a little clearer, thus I'll skip ahead into a very brief implementation idea.

Currently, paths that have already been saved along with their metadata are stored in plain text/data. These information are later used to check whatever a file is already contained in the archive or needs to be stored. The same applies to updating a stored file, which is based on the files timestamp (if I'm not mistaken; file permissions are not (yet?) taken into account).

But for archiving this, you don't need to have the decrypted data. Simply adding a hash over all attributes required to detect new files/updated files would do the trick. These hashes needs to be stored aside with the current index which then will be encrypted. In order to check whatever a file has changed, build the said hash locally and check whatever the latest band contains that hash.

Noteable is that whenever the attributes change, which determine whatever a file has changed, all stored hashes must be recomputed (based on the encryped data) which therefore requires the privat key. But considering updating an archive (mainly because by using a newer conserve version) is already a manual step, therefore prompting the user for his private key is acceptable.

As a second note: Only backup operations can be done without providing the private key. All other archive operations like restoring, updating (as mentioned above) or deleting bands are not possible without reading and decrypting the contents of the backup.

~ Markus

Jan 04 '23 22:01 WolverinDEV

Hey there, happy new year.

I earlier put some thoughts on encryption in https://github.com/sourcefrog/conserve/pull/191 and perhaps you have some thoughts on that. This is symmetric in the sense that writers can read previous backups; perhaps we want to and can avoid that. It also outlines some goals.

I agree that encryption can often be taken careby the storage layer but mostly this isn't trivial to set up.

I see this a bit differently: all my machines on various OSes have good whole-filesytem encryption built into the OS and turned on for both built-in and USB disks. Cloud providers have good easily configurable encryption (e.g. https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-encryption.html.) You can argue whether you want to trust the cloud provider to manage the key: I'd say often it is reasonable but some people will not want to, especially if they are using the cloud only to host backups. It has some advantages including arguably making it harder to lose the key: if you can recover access to the account you can get back to the key.

Anyhow all this is rather philosophical: I think it'd be nice to have encryption in Conserve but it's also often reasonable to encrypt at a different layer. We don't need to decide about how much to trust cloud providers in general.

Creating backups is mostly an automated task scheduled every once in a while (hopefully regularly!). Therefore I want to store the encryption key on the backup storage medium as well. Thus rendering the encrption useless. Having to manually trigger the backup every time isn't the best thing to do ether, just to provide the backup key / password. Note: Storing the backup key on the target which the backup has been made of isn't an option as in case of failure you'll loose access to said key and therefore to the backup as well.

I agree automated scheduled backups are important; that's how I use it. So any encryption keys need to be available without any per-backup user action. (It might be reasonable to for example have a key in an OS keyring that's unlocked by the user's passphrase when they log in, or on an encrypted volume.) But the basic case is that the encryption key is accessible on the machines making the backup.

I don't think I follow the argument why this makes encryption useless. I agree it would be useless to have the encryption key stored in clear text alongside the backup.

I think you could be talking about either of two cases:

The encryption key is stored in, say, the home directory that's backed up, and therefore the backup will include an encrypted copy of the key. But, if the encryption scheme is robust, this is harmless, as it's only accessible to someone who already possesses the key?
The user has to store the encryption key somewhere they will be sure not to lose it if the source machine is lost. This is, in general, an issue with encryption keys for backups, whether symmetric or asymmetric? But the user could, for example, store a copy of the key encrypted by a passphrase, in cloud storage.

Moving to the implementation idea:

The way you describe it tends to sound like we have an unencrypted list of filename, hash that the backup writer can consult to see if a particular file has changed. That would seem to make the filenames visible to anyone who can read the archive, which seems bad.

To avoid that but still let writers write incremental backups without reading previous content, we could hypothetically just expose the timestamp of the previous backup, and write everything with an mtime after that date. However, then there's the question of how to hash the data. If we use an unkeyed hash or with a key known to the writer, it's easy to see what data is stored. Alternatively perhaps the writer chooses a new per-backup key, but then we'll never match common blocks across backups, so a lot more blocks will be written and a lot more space will be used.

I'm also just not sure yet that it's important that writers can't read old content.

Jan 05 '23 04:01 sourcefrog

Happy new year to you as well.

We don't need to decide about how much to trust cloud providers in general.

Yes, your nailed it. The basic discussions boils down to this. Imo when it comes to security, the first rule is, never trust anybody. Since we already agreed adding encryption in Conserve is a good step, this dosn't needs to be discussed any further :).

the backup will include an encrypted copy of the key

I totally agree, this isn't the issue either. The main point I was worried about is the second case. Tbh I somehow totally neglected the idea of creating a copy of the key and storing it elsewhere. On first glance this actually does the trick and solves my concerns.

In regards to the implementation idea: Instead of storing the file name, just store a hash over the file name and it's attributes. Yes, this will introduce the possibility for collisions, but using SHA256 as an example it's incredibly unlikely.

Edit: I saw #191 already but only took a quick glance at it and didn't go trough detailed.

Jan 05 '23 10:01 WolverinDEV

Happy new year to you as well.

We don't need to decide about how much to trust cloud providers in general.

Yes, your nailed it. The basic discussions boils down to this. Imo when it comes to security, the first rule is, never trust anybody.

Well, this also gets very philosophical. 😁 I'd say more important maxims are "be clear what you're trying to achieve and what your threat model is", and "keep it simple."

"Never trust anyone" is not really practical as such. "Be clear who you're trusting and for what", which is part of a threat model, is more helpful in my experience.

Since we already agreed adding encryption in Conserve is a good step, this dosn't needs to be discussed any further :).

the backup will include an encrypted copy of the key

I totally agree, this isn't the issue either. The main point I was worried about is the second case. Tbh I somehow totally neglected the idea of creating a copy of the key and storing it elsewhere. On first glance this actually does the trick and solves my concerns.

👍🏻

In regards to the implementation idea: Instead of storing the file name, just store a hash over the file name and it's attributes. Yes, this will introduce the possibility for collisions, but using SHA256 as an example it's incredibly unlikely.

So this is a great example of "first, be clear on the goals and threat model."

If the archive has a list of SHA256(apath, mtime) then it's practical for someone to determine whether a file with a particular path is present by guessing at possible mtimes, many of which will be quantized to seconds or milliseconds and so have only millions of possible values.

Determining whether a file with a known name is present is not the worst attack, but it's not nothing, so I'd rather not accept it unless there's some compelling reason.

Edit: I saw #191 already but only took a quick glance at it and didn't go trough detailed.

Jan 05 '23 19:01 sourcefrog

conserve conserve copied to clipboard

Built-in authenticated encryption of archive

conserve
conserve copied to clipboard