pycryptodome
pycryptodome copied to clipboard
Proper way to do AES-GCM encryption of big files (larger than RAM) in blocks
When encrypting big files using AES-GCM, potentially 10 GB or more, for memory (RAM) reasons, we need to processs them by blocks (let's say 16 MB), rather than doing encrypt(plaintext) in one pass.
Is the following approach ok?
nonce = Random.new().read(16)
out.write(nonce)
cipher = AES.new(key, AES.MODE_GCM, nonce=nonce)
while True:
block = f.read(16*1024*1024)
if not block: # EOF
break
out.write(cipher.encrypt(block)) # we encrypt multiple blocks with the same
# "cipher" object, especially the same nonce
out.write(cipher.digest()) # we compute the auth. tag only once at the end
Here we encrypt multiple 16MB blocks with the same "cipher" object (same nonce).
I read some criticisms about this approach in the article AEADs: getting better at symmetric cryptography, paragraph "AEADs with large plaintexts".
On the other hand, it really behaves like a stream cipher, so everything looks ok:
print(cipher.encrypt(b'hello')) # 4cadd813be in hexadecimal
print(cipher.encrypt(b'hello')) # d3585e3471, different, fortunately!
TL;DR What is the correct way to do big files encryption with pycryptodome + AES-GCM?
@josephernest
I believe that a stream cipher would provide the solution you are looking for. You can pick one from the pycryptodome documentation of stream ciphers.
E.g. https://pycryptodome.readthedocs.io/en/latest/src/cipher/chacha20_poly1305.html for cryptography and authentication.
@texadactyl Thank you for your answer. It seems it works also with AES-GCM: for example chunking by 4 bytes has no impact on the encrypted result.
import Crypto.Random, Crypto.Cipher.AES
key = bytes.fromhex('7d29ccf69c671775e17d4b9dd6485fd8')
nonce = bytes.fromhex('04972c7927042af0ee10c7e6ac56ddd3')
# usual method (whole plaintext in one pass)
cipher = Crypto.Cipher.AES.new(key, Crypto.Cipher.AES.MODE_GCM, nonce=nonce)
print(cipher.encrypt(b'goodgoodcrypto').hex()) # e7e4d3b74617d78022376651ba3a
# with chunks
cipher2 = Crypto.Cipher.AES.new(key, Crypto.Cipher.AES.MODE_GCM, nonce=nonce)
print(cipher2.encrypt(b'good').hex()) # e7e4d3b7
print(cipher2.encrypt(b'good').hex()) # 4617d780
print(cipher2.encrypt(b'cryp').hex()) # 22376651
print(cipher2.encrypt(b'to').hex()) # ba3a
# gives exactly the same result! i.e. e7e4d3b74617d78022376651ba3a
shouldn't the output of AES in GCM be longer than the input by 16 bytes? You need the authentication tag for it to be GCM...
@tomato42 To generate the authentication tag, one has to use the encrypt_and_digest or digest method of the cipher object:
@tomato42 For this, do:
print(cipher.digest().hex() # auth tag: d7552b8b7c8e96bd1cc942d900c90cbc
you will get the same result on both my examples.
You can also use encrypt_and_digest(...).
Your approach suffers from the fact that your are only checking the digest at the very end of decryption. So you are using unauthenticated data with out.write(cipher.encrypt(block)), and you only find out if it is not authentic at the very end.
If you look at any generic AEAD description it will tell you that you must not output any data until it has all been authenticated.
It may not be a big deal in your case but that's what the article is complaining about. I recommend looking into Rogaway et al's Online Authenticated Encryption^1. Especially the STREAM api in section 7. This does what the article basically suggests: split the file into so many variable sized chunks and use AEAD with each of them. Rogaway's approach uses the nonce as a counter, which prevents an attacker from reordering chunks without causing authentication failure.
An easier approach would be to run your AEAD cipher twice: once just to authenticate your data, and then again to decrypt. This would work if you are not worried about the file changing between authentication and decryption.
Also to be clear: AES GCM is a stream cipher. It uses AES in counter mode, with GCM to authenticate.