age icon indicating copy to clipboard operation
age copied to clipboard

Make age parallel

Open paulmillr opened this issue 4 years ago • 6 comments

If you encrypt files on a machine with tons of RAM and cores, age isn't any faster versus some basic slow PC.

I think it would be great to utilize resources when they're available.

Tried this on Linux via piping and via -i -o — seeing tiny load of one core.

paulmillr avatar Mar 09 '20 02:03 paulmillr

This is that it isn't feasible to do that without overhauling Go's cryptograhpy libraries (and might be unsafe, I don't know enough about goroutine security to say for sure).

The only functions in age that actually handle the plaintext are EncryptOAEP/DecryptOAEP from crypto/rsa and Seal/Open from x/crypto/chacha20poly1305, neither of which are parallel. Both could be parallelized, but RSA generally hasn't because it needs a parallel-friendly modular exponentiation function. ChaCha is fairly easy to parallelize, but Go's implementation is handwritten assembly using vector instructions when available (unless you're using a purego build, gccgo, or an uncommon CPU architecture). I have a feeling that probably outperforms a goroutine version, but maybe not.

RKinsey avatar Mar 18 '20 14:03 RKinsey

@RKinsey I'm not sure if this argument actually holds. internal/stream/stream.go seems to read and write in chunks of 64 KiB (plus 16 bytes of Poly1305 tag for each encrypted chunk). Therefore, there's parallelization potential there by queueing up the encryption/decryption of chunks (or multiples of chunks) between cores. Orchestrating the whole thing so that there's no bottleneck when reading or writing is another story though.

xorhash avatar Apr 03 '20 13:04 xorhash

Yeah @RKinsey was talking about the key-wrapping phase. The actual symmetric stream encryption is where the bulk of Age's work happens (at least on larger file sizes) and it looks like it could be parallelizable.

The stream is divided into fixed-size chunks of 64 kB, and each chunk uses the same encryption key but of course a different nonce. The nonce is calculated based on the chunk number. It's a seekable stream and thus theoretically easily parallelizable. Though practically the code will be more complex than what currently is - so it'd need pretty good testing suite.

joonas-fi avatar Oct 29 '20 11:10 joonas-fi

Just running chacha20-poly1305 in parallel for a few blocks easily more than doubles the speed. My own tool is written in Python and does 2.2 GB/s encryption and decryption (using 4 threads for chacha, otherwise single-threaded). It is a shame that the crypto libraries don't offer threaded implementations of these algorithms.

This is on a machine where age does 1 GB/s and rage only 400 MB/s.

Tronic avatar Oct 31 '21 21:10 Tronic

@Tronic are you using the latest rage? The speed difference should be minimal right now

paulmillr avatar Oct 31 '21 23:10 paulmillr

@paulmillr rage 0.7.0 on Windows.

Tronic avatar Nov 01 '21 00:11 Tronic