signatures icon indicating copy to clipboard operation
signatures copied to clipboard

slh-dsa: missing midstate caching

Open conduition opened this issue 6 months ago • 8 comments

Was there an intentional choice to omit SHA2 midstate caching from the SLH-DSA crate or is this still a "to-do" for later?

The SLH-DSA spec is written intentionally to encourage this, by padding $\text{PK.seed}$ to match the SHA2 block size in calls to $T_{\ell}$ and its subclass functions $H$ and $F$.

Also the HMAC here can be cached.

In case you are thinking "the perf improvement isn't worth it", you would be wrong. In my SLH-DSA implementation (closed source for now), I benchmarked signing with and without midstate/HMAC caching.

Without midstate caching

Params keygen signing verifying
SLH-DSA-SHA2-128f 3.2 ms 108 ms 6.8 ms
SLH-DSA-SHA2-128s 314 ms 2345 ms 2.4 ms

With midstate caching

Params keygen signing verifying
SLH-DSA-SHA2-128f 1.5 ms 45.1 ms 3.2 ms
SLH-DSA-SHA2-128s 143 ms 1102 ms 1.1 ms

On average, this midstate caching improves speed by a factor of about 2.2x on my machine, at least for 128-bit parameter sets. I haven't implemented higher security parameter sets yet.

conduition avatar Aug 07 '25 00:08 conduition

For clarity, the above metrics are using my own implementation. The RustCrypto slh-dsa crate's performance is about on-par with mine when midstate-caching is disabled.

RustCrypto slh-dsa (no midstate caching)

Params keygen signing verifying
SLH-DSA-SHA2-128f 4.3 ms 101 ms 6.0 ms
SLH-DSA-SHA2-128s 283 ms 2080 ms 2.4 ms

conduition avatar Aug 07 '25 01:08 conduition

Hmm. Turns out the performance disparity in my code was unrelated to midstate caching. The source of the speedup between my lib and the slh-dsa crate must actually be coming from somewhere else. I'll post here again once I figure it out.

conduition avatar Aug 07 '25 22:08 conduition

cc @bifurcation

I don't think it was deliberate. Sounds like a missing optimization.

tarcieri avatar Aug 07 '25 23:08 tarcieri

I think you have me mixed up with someone else. I only worked on ML-DSA, not SLH-DSA.

bifurcation avatar Aug 08 '25 01:08 bifurcation

Aah sorry! Should've cc'd @tjade273

tarcieri avatar Aug 08 '25 15:08 tarcieri

i'm not sure what drugs I was taking yesterday, but the benchmarks i'm running today paint a clear picture.

Here's rust crypto's slh-dsa crate, benchmarked with the Sha2_128s parameter set.

test rustcrypto_compare::benchmark_rustcrypto_slh_dsa_keygen            217,295,070.00 ns/iter (+/- 39,402,576.65)
test rustcrypto_compare::benchmark_rustcrypto_slh_dsa_sign_internal   1,722,601,885.10 ns/iter (+/- 616,310,319.28)
test rustcrypto_compare::benchmark_rustcrypto_slh_dsa_verify_internal     1,496,752.00 ns/iter (+/- 55,788.19)

Here's the code i used if you're curious

type RustCryptoParams = slh_dsa::Sha2_128s;

#[bench]
fn benchmark_rustcrypto_slh_dsa_keygen(b: &mut test::Bencher) {
    let mut rng = rand::thread_rng();
    b.iter(|| slh_dsa::SigningKey::<RustCryptoParams>::new(&mut rng));
}

#[bench]
fn benchmark_rustcrypto_slh_dsa_sign_internal(b: &mut test::Bencher) {
    let mut rng = rand::thread_rng();
    let keypair = slh_dsa::SigningKey::<RustCryptoParams>::new(&mut rng);
    let msg = *b"hello world";
    b.iter(|| keypair.slh_sign_internal(&msg, None));
}

#[bench]
fn benchmark_rustcrypto_slh_dsa_verify_internal(b: &mut test::Bencher) {
    let mut rng = rand::thread_rng();
    let keypair = slh_dsa::SigningKey::<RustCryptoParams>::new(&mut rng);
    let msg = *b"hello world";
    let signature = keypair.slh_sign_internal(&msg, None);
    let pubkey: &slh_dsa::VerifyingKey<_> = keypair.as_ref();
    b.iter(|| pubkey.slh_verify_internal(&msg, &signature));
}

Here's my crate without midstate caching, with the same SLH-DSA-SHA2-128s parameter set. It's a little faster than slh-dsa but not massively so. The without_cached_root_tree benchmark is running the exact same algorithms from FIPS205 that your crate is running: Deterministically derive all secrets at signing time, no caching whatsoever. The with_cached_root_tree benchmark does the same but with the top-level XMSS tree fully cached in memory (saves $1/d$ time per signature, as it would otherwise be recomputed every time).

test benchmark_slh_dsa_keygen                                           200,688,515.00 ns/iter (+/- 26,998,147.40)
test benchmark_slh_dsa_sign_internal_with_cached_root_tree            1,332,255,061.40 ns/iter (+/- 48,552,050.67)
test benchmark_slh_dsa_sign_internal_without_cached_root_tree         1,606,177,744.30 ns/iter (+/- 277,946,970.08)
test benchmark_slh_dsa_verify_internal                                    1,555,584.80 ns/iter (+/- 297,994.59)

Here's my crate with midstate caching. It's enormously faster now.

test benchmark_slh_dsa_keygen                                           132,567,475.30 ns/iter (+/- 61,342,421.73)
test benchmark_slh_dsa_sign_internal_with_cached_root_tree              740,373,699.80 ns/iter (+/- 142,053,816.77)
test benchmark_slh_dsa_sign_internal_without_cached_root_tree           858,701,966.90 ns/iter (+/- 187,697,492.47)
test benchmark_slh_dsa_verify_internal                                      816,321.60 ns/iter (+/- 35,688.97)

conduition avatar Aug 09 '25 01:08 conduition

@tarcieri rather than beating around the bush i invited you and @tjade273 to my repo as collaborators so you can inspect my benchmarks.

conduition avatar Aug 09 '25 02:08 conduition

I agree this is an optimization we should do; I can't guarantee it will happen immediately, but I will put it on my TODO list unless someone else claims it first

tjade273 avatar Aug 09 '25 02:08 tjade273