gocryptfs icon indicating copy to clipboard operation
gocryptfs copied to clipboard

Feature Request: Reverse mode without deterministic initialization vectors

Open slackner opened this issue 6 years ago • 14 comments

First of all, thank you very much for this awesome project. Especially the reverse-mode is very interesting for my specific use-case. Unfortunately, at least in the current implementation, I would consider the cryptography of the reverse-mode a bit weaker than the regular mode. The main reason is that initialization vectors are deterministic for blocks.

In contrast to regular mode, modification of a block of a file does not lead to the generation of a new IV. When a file is frequently changed, an attacker might be able to collect many samples of ciphertext, all encrypted with the same key and IV. Even without any further knowledge, this allows to detect whenever a file is reverted to a previous state. In a more sophisticated attack, an attacker could also try to identify files in a directory based on file size patterns. Lets assume the directory contains a cloned open-source repository, an attacker might be able to collect multiple plaintext <-> ciphertext samples, all encrypted with the same key and IV.

I understand that it was implemented this way to ensure that, for example when doing repeated backups, only the changed files are transmitted. Nevertheless, there is also another possibility which would solve this problem without falling back to deterministic IVs. The general idea would be to store IVs (and maybe also checksums to detect modified blocks) separately from the original data, either at a separate location or by using extended attributes. Reverse mode could then use these stored IVs (or create them if they don't exist yet). Even if the stored IVs are lost this would not really be a critical issue. In the worst case, new IVs are generated and files have to be transmitted again.

What do you think about this idea, and is it feasible to implement something like that into gocryptfs?

Best regards, Sebastian

slackner avatar Nov 15 '17 15:11 slackner

Hi sebastian, thanks for the kind words!

With the "iv database",we would have to store checksums of all blocks of all files. This seems to be pretty heavyweight to me.

What about mixing the file timestamp into the IV generation, activated via a command-line switch?

The other thing you described is what i called "file size fingerprinting" in the threat model. With the 1:1 file to encrypted file model that gocryptfs uses the file sizes will always be "public".

rfjakob avatar Nov 15 '17 15:11 rfjakob

On 15.11.2017 16:56, rfjakob wrote:

With the "iv database",we would have to store checksums of all blocks of all files. This seems to be pretty heavyweight to me.

Not sure if it is such a big problem. The storage costs with 16 byte IV + 4 byte checksum are still smaller than for the regular mode, where we have to store 16 byte IV + 16 byte GHASH. At least for me that would be perfectly fine, especially when taking to account that the checksums do not have to be transmitted during the backup process anyway.

What about mixing the file timestamp into the IV generation, activated via a command-line switch?

That would also be an option, but it would no longer allow partial updates of a file. For big files like VM images this is still very useful.

The other thing you described is what i called "file size fingerprinting" in the threat model. With the 1:1 file to encrypted file model that gocryptfs uses the file sizes will always be "public".

I am aware that this problem also exists on general, but only in reverse mode it is possible to exploit this method to get multiple known plaintext <-> ciphertext pairs with same key and IV.

slackner avatar Nov 15 '17 16:11 slackner

Storage space is one thing, but at the moment reverse mode does not need any storage space. The bigger issue is the code complexity this database adds. Also, getting this to run fast will be difficult, and a read-only workload would turn into read-write due to this. Storing the info in extended attributes solve the question of how to purge obsolete entries from the database, but, as far as I known, extended attributes do not support partial updates.

rfjakob avatar Nov 15 '17 17:11 rfjakob

I agree that it is difficult to get this implemented correctly and to get it fast. Nevertheless, in my specific use-case neither performance nor storage space is the critical factor. I'm planning to use it for remote backups, and rsync will skip files with same modified date anyway.

I've decided to drop the idea with extended attributes because of the size limitations. Apparently a lot of file systems limit the size of each attribute to 4KB, which makes it much more difficult to use them for this purpose.

A prototype for my idea (not in a mergable state!) is available at: https://github.com/fds-team/gocryptfs/commits/nondet-reverse

Current limitations:

  • Performance is relatively slow (maybe it can be made faster by allowing to use GCM again?)
  • No pruning of obsolete entries implemented yet
  • "ivChanged" variable will need some locking

slackner avatar Nov 18 '17 09:11 slackner

Oh, you have a prototype already, nice! While I was not very motivated of implementing this myself, I'll not decline a pull request for this feature.

Comment on https://github.com/fds-team/gocryptfs/blob/f25bb022028f90fa8e8c9154164e155178b2f929/internal/pathiv/pathiv.go#L103 : You could pass the ciphertext instead and look at the message authentication code. This way there is no extra hash needed if the block has not changed.

rfjakob avatar Nov 18 '17 16:11 rfjakob

If I read https://tools.ietf.org/html/rfc5297#section-2.6 correctly, the MAC is in the first 16 bytes of the ciphertext in AES-SIV.

rfjakob avatar Nov 18 '17 16:11 rfjakob

On 18.11.2017 17:17, rfjakob wrote:

Oh, you have a prototype already, nice! While I was not very motivated of implementing this myself, I'll not decline a pull request for this feature.

Awesome :) I'll probably start with pull request(s) for the cleanup commits. The feature itself needs some more work, especially the code to save/load the IVs to/from the disk is still very hacky. Currently I just use the gob package, do you have a better suggestion which database or storage method could be used? The problem with gob is that the full file needs to be written each time. In practice, it would be better to allow partial updates.

Comment on https://github.com/fds-team/gocryptfs/blob/f25bb022028f90fa8e8c9154164e155178b2f929/internal/pathiv/pathiv.go#L103 : You could pass the ciphertext instead and look at the message authentication code. This way there is no extra hash needed if the block has not changed.

I'm not sure if I understand the idea. The ciphertext isn't available yet when the BlockIV function is called. Do you mean that I should first try to use the old IV, and then generate a new one if the authentication code changes? I agree that this would be more secure (no need to store checksums of the plaintext), but has the disadvantage that some blocks have to be encrypted twice. Or is there also a way to compute just the authentication code without encrypting the block itself?

slackner avatar Nov 18 '17 18:11 slackner

I'm not sure how to handle partial updates best, but i would not want to pull in something huge like sqlite. Maybe it's good enough to rewrite the file on exit (and maybe periodically).

Do you mean that I should first try to use the old IV, and then generate a new one if the authentication code changes?

Yes exactly. The upside is that you don't have to do any extra work if the block has not changed.

If the block does have changed, you do "AES-SIV, AES-SIV" instead of "SHA256, AES-SIV".

rfjakob avatar Nov 18 '17 19:11 rfjakob

If I read https://tools.ietf.org/html/rfc5297#section-2.6 correctly, the MAC is in the first 16 bytes of the ciphertext in AES-SIV.

Thats correct, but gocryptfs also adds the IV, so the MAC is in the range 16:32. I've updated my repository with the proposed changes and it seems to work well. Still has to be considered highly experimental though. ;)

slackner avatar Nov 21 '17 20:11 slackner

As you might have already noticed (see the commit above), in the meantime I have switched to an extended attribute based approach. There are no longer completely random block IVs, but it still solves the most important issues. The advantages and disadvantages are discussed in the commit message. What is your opinion on this approach, is there any chance to get something like this upstream?

https://github.com/slackner/gocryptfs/commit/39cdd4cb775bbc9f0561bbfd1896a4b420cb4471

Best regards, Sebastian

slackner avatar Dec 16 '18 20:12 slackner

I have some free time after new year's eve, will check!

rfjakob avatar Dec 24 '18 05:12 rfjakob

  1. Renaming or moving a file causes a full retransmission due to the changed initialization vectors.

It seems you have solved my problem with online backups in reverse mode #402 where moving big file from one directory to another would trigger a slow reupload of that file.

My simpler and less elegant approach to solving that problem would have been to remove the "directory" part from the IV and then store a pgp signed hashdeep file alongside the backup. I can do that since I only backup snapshots of the files every few hours and I don't really care how fast hashdeep is, and also the files don't change while running hashdeepsince it's a snapshot. Because I would check the signed hashes before restoring any files gocryptfs does not need to detect any modification, it is just responsible for encryption (confidentiality).

But your approach of storing a new IV for each file separately seems interesting.

kwinz avatar Aug 23 '19 16:08 kwinz

But here's a thought: What's the actual benefit of per file IVs anyway?

  1. That having the same file twice would create two different encrypted counterparts?

That the same files map to the same encrypted binaries, is a desirable property, because then they can be de-duplicated by the backup daemon. So if I have the same file e.g. 5 times they are only uploaded once, even if the backup daemon can only see the encrypted files.

  1. "an attacker might be able to collect multiple plaintext <-> ciphertext samples, all encrypted with the same key and IV."

Is that a problem assuming that AES-SIV is not broken?

I think what I actually need for reverse mode backups is a per volume IV not a per file IV, so that the backup demon does not know if I have the same file as another customer. But I have that property anyway because the content key is derived from the master key and the master key is derived from the password and gocryptfs.conf's global IV.

kwinz avatar Aug 23 '19 17:08 kwinz

For those interested in an encrypted, deduping file system, you might be interested in Tahoe-LAFS. It might be a fit for some backup scenarios, since deduping is an intrinsic property of the file system. I don't mean it's a replacement for gocryptfs, since they're completely different beasts. Just sharing a software which may be useful, or al least, interesting for the audience of gocryptfs. My 2c.

pataquets avatar May 28 '20 00:05 pataquets