blake2_simd icon indicating copy to clipboard operation
blake2_simd copied to clipboard

make compress_block function public

Open ordian opened this issue 6 years ago • 2 comments

There are some needs to use the compression function from RFC 7693. Would you be willing to expose it in your awesome crate or extract it in a lower-level crate that other libs can use?

ordian avatar Sep 03 '19 21:09 ordian

Very interesting. Yes, we could totally expose it. The current private compression APIs include some complexity that might not be relevant for your use case:

  • 2x/4x (or in the case of BLAKE2s, 4x/8x) parallel compression
  • an internal loop that compresses many blocks in a single call, to avoid paying the cost of vector transposition for every block
  • a stride parameter, to allow BLAKE2bp/BLAKE2sp to reuse the BLAKE2b/BLAKE2s code

It sounds you're only looking to expose 1x compression, in which case the looping optimization isn't necessary, because there's nothing to transpose. And you don't need the stride parameter. You presumably still want to do runtime feature detection (whether to use the portable.rs implementation or the avx2.rs implementation). Currently we assume you've done that once at a higher level and passed the result down as a parameter, which requires a different private API, but it's possible we could move that detail inside the compression API without a measurable loss in performance. (I think is_x86_feature_detected! boils down to a single Ordering::Relaxed load, but I haven't looked at the assembly output.) So we might be able to expose a simpler API than what's currently there.

One issue I notice in the spec you linked, is that the number of rounds is a free parameter. That clashes with this implementation in this crate, which hardcodes all the rounds so that they all get inlined. There are a number of different SIMD tricks that get deployed to load message words efficiently, which differ from round to round (here's the code). It's possible we could split all that out into a big switch statement, but I also worry that branch prediction isn't going to work well there, and that performance would suffer. That's the sort of thing I'd want to test very carefully.

A side note: API described in the spec you linked is a little less flexible than it could be, because while it takes a variable number of rounds, it always assumes it's starting from round 1. A more flexible API might be able to express something like "just do one round of compression, but make it round number 3."

Another side note: If you're exposing the BLAKE2b compression function, you might also want to consider exposing BLAKE2s. The relative performance of BLAKE2b and BLAKE2s on 64-bit CPUs is somewhat complicated once SIMD enters the picture, and BLAKE2b isn't always faster.

oconnor663 avatar Sep 03 '19 23:09 oconnor663

Thank you for the quick and very insightful reply!

Currently we assume you've done that once at a higher level and passed the result down as a parameter, which requires a different private API, but it's possible we could move that detail inside the compression API without a measurable loss in performance. (I think is_x86_feature_detected! boils down to a single Ordering::Relaxed load, but I haven't looked at the assembly output.) So we might be able to expose a simpler API than what's currently there.

There is ifunc trick to do cpu detection only on the first call.

One issue I notice in the spec you linked, is that the number of rounds is a free parameter. That clashes with this implementation in this crate, which hardcodes all the rounds so that they all get inlined. There are a number of different SIMD tricks that get deployed to load message words efficiently, which differ from round to round (here's the code). It's possible we could split all that out into a big switch statement, but I also worry that branch prediction isn't going to work well there, and that performance would suffer. That's the sort of thing I'd want to test very carefully.

I see, that's a valid concern and I agree the impact needs to be measured.

A side note: API described in the spec you linked is a little less flexible than it could be, because while it takes a variable number of rounds, it always assumes it's starting from round 1. A more flexible API might be able to express something like "just do one round of compression, but make it round number 3."

Another side note: If you're exposing the BLAKE2b compression function, you might also want to consider exposing BLAKE2s. The relative performance of BLAKE2b and BLAKE2s on 64-bit CPUs is somewhat complicated once SIMD enters the picture, and BLAKE2b isn't always faster.

I think the current spec tries to expose the bare minimum and can be extended later.

ordian avatar Sep 04 '19 09:09 ordian