slh-dsa: missing midstate caching
Was there an intentional choice to omit SHA2 midstate caching from the SLH-DSA crate or is this still a "to-do" for later?
The SLH-DSA spec is written intentionally to encourage this, by padding $\text{PK.seed}$ to match the SHA2 block size in calls to $T_{\ell}$ and its subclass functions $H$ and $F$.
Also the HMAC here can be cached.
In case you are thinking "the perf improvement isn't worth it", you would be wrong. In my SLH-DSA implementation (closed source for now), I benchmarked signing with and without midstate/HMAC caching.
Without midstate caching
| Params | keygen | signing | verifying |
|---|---|---|---|
SLH-DSA-SHA2-128f |
3.2 ms | 108 ms | 6.8 ms |
SLH-DSA-SHA2-128s |
314 ms | 2345 ms | 2.4 ms |
With midstate caching
| Params | keygen | signing | verifying |
|---|---|---|---|
SLH-DSA-SHA2-128f |
1.5 ms | 45.1 ms | 3.2 ms |
SLH-DSA-SHA2-128s |
143 ms | 1102 ms | 1.1 ms |
On average, this midstate caching improves speed by a factor of about 2.2x on my machine, at least for 128-bit parameter sets. I haven't implemented higher security parameter sets yet.
For clarity, the above metrics are using my own implementation. The RustCrypto slh-dsa crate's performance is about on-par with mine when midstate-caching is disabled.
RustCrypto slh-dsa (no midstate caching)
| Params | keygen | signing | verifying |
|---|---|---|---|
SLH-DSA-SHA2-128f |
4.3 ms | 101 ms | 6.0 ms |
SLH-DSA-SHA2-128s |
283 ms | 2080 ms | 2.4 ms |
Hmm. Turns out the performance disparity in my code was unrelated to midstate caching. The source of the speedup between my lib and the slh-dsa crate must actually be coming from somewhere else. I'll post here again once I figure it out.
cc @bifurcation
I don't think it was deliberate. Sounds like a missing optimization.
I think you have me mixed up with someone else. I only worked on ML-DSA, not SLH-DSA.
Aah sorry! Should've cc'd @tjade273
i'm not sure what drugs I was taking yesterday, but the benchmarks i'm running today paint a clear picture.
Here's rust crypto's slh-dsa crate, benchmarked with the Sha2_128s parameter set.
test rustcrypto_compare::benchmark_rustcrypto_slh_dsa_keygen 217,295,070.00 ns/iter (+/- 39,402,576.65)
test rustcrypto_compare::benchmark_rustcrypto_slh_dsa_sign_internal 1,722,601,885.10 ns/iter (+/- 616,310,319.28)
test rustcrypto_compare::benchmark_rustcrypto_slh_dsa_verify_internal 1,496,752.00 ns/iter (+/- 55,788.19)
Here's the code i used if you're curious
type RustCryptoParams = slh_dsa::Sha2_128s;
#[bench]
fn benchmark_rustcrypto_slh_dsa_keygen(b: &mut test::Bencher) {
let mut rng = rand::thread_rng();
b.iter(|| slh_dsa::SigningKey::<RustCryptoParams>::new(&mut rng));
}
#[bench]
fn benchmark_rustcrypto_slh_dsa_sign_internal(b: &mut test::Bencher) {
let mut rng = rand::thread_rng();
let keypair = slh_dsa::SigningKey::<RustCryptoParams>::new(&mut rng);
let msg = *b"hello world";
b.iter(|| keypair.slh_sign_internal(&msg, None));
}
#[bench]
fn benchmark_rustcrypto_slh_dsa_verify_internal(b: &mut test::Bencher) {
let mut rng = rand::thread_rng();
let keypair = slh_dsa::SigningKey::<RustCryptoParams>::new(&mut rng);
let msg = *b"hello world";
let signature = keypair.slh_sign_internal(&msg, None);
let pubkey: &slh_dsa::VerifyingKey<_> = keypair.as_ref();
b.iter(|| pubkey.slh_verify_internal(&msg, &signature));
}
Here's my crate without midstate caching, with the same SLH-DSA-SHA2-128s parameter set. It's a little faster than slh-dsa but not massively so. The without_cached_root_tree benchmark is running the exact same algorithms from FIPS205 that your crate is running: Deterministically derive all secrets at signing time, no caching whatsoever. The with_cached_root_tree benchmark does the same but with the top-level XMSS tree fully cached in memory (saves $1/d$ time per signature, as it would otherwise be recomputed every time).
test benchmark_slh_dsa_keygen 200,688,515.00 ns/iter (+/- 26,998,147.40)
test benchmark_slh_dsa_sign_internal_with_cached_root_tree 1,332,255,061.40 ns/iter (+/- 48,552,050.67)
test benchmark_slh_dsa_sign_internal_without_cached_root_tree 1,606,177,744.30 ns/iter (+/- 277,946,970.08)
test benchmark_slh_dsa_verify_internal 1,555,584.80 ns/iter (+/- 297,994.59)
Here's my crate with midstate caching. It's enormously faster now.
test benchmark_slh_dsa_keygen 132,567,475.30 ns/iter (+/- 61,342,421.73)
test benchmark_slh_dsa_sign_internal_with_cached_root_tree 740,373,699.80 ns/iter (+/- 142,053,816.77)
test benchmark_slh_dsa_sign_internal_without_cached_root_tree 858,701,966.90 ns/iter (+/- 187,697,492.47)
test benchmark_slh_dsa_verify_internal 816,321.60 ns/iter (+/- 35,688.97)
@tarcieri rather than beating around the bush i invited you and @tjade273 to my repo as collaborators so you can inspect my benchmarks.
I agree this is an optimization we should do; I can't guarantee it will happen immediately, but I will put it on my TODO list unless someone else claims it first