hashes icon indicating copy to clipboard operation
hashes copied to clipboard

blake2: integrate `blake2b_simd`/`blake2s_simd` crates

Open tarcieri opened this issue 3 years ago • 32 comments

Integrates original sources from these crates, which provide AVX2-accelerated SIMD backends:

https://github.com/oconnor663/blake2_simd

Taken from this commit:

  • Hash: 7bf791e67245bb84132d1ee0e6a893bb8c85c093
  • Author: Jack O'Connor [email protected]
  • Date: Fri Nov 13 15:50:16 2020 -0500
  • Title: AES-CTR benchmarks

Next steps

  • [x] Determine if we want to proceed with a combined blake2 crate or if we can get the blake2b crate and use blake2b/blake2s
  • [x] Define Blake2b/Blake2s structs which provide impls of the digest traits. These can also be used to replace the blake2::blake2b::blake2b/blake2::blake2s::blake2s static functions e.g. with Blake2b::digest(...)
  • [x] Also define VarBlake2b/VarBlake2s(?)
  • [x] Address compile errors/panics on no_std targets
  • [x] Figure out solution for handling KATs
  • [x] ~~Address clippy errors (presently TODOs) in the PR~~ (caused by arrayref, can circle back on that)

tarcieri avatar Feb 04 '21 15:02 tarcieri

As discussed in #88, this PR merges the blake2b_simd and blake2s_simd crates into the blake2 crate, placing each algorithm under blake2b and blake2s modules respectively, and promoting the blake2bp and blake2sp modules to the toplevel of the crate.

Alternatively it might be possible to retain the two-crate structure, however unfortunately while we own the blake2s crate, we don't own the blake2b crate. I have contacted its current owner (as presently all releases are yanked so it's unused) to see if we can obtain it, in which case we can do a more straightforward import of blake2b_simd => blake2b and blake2s_simd => blake2s.

~~This PR is presently about 170 klocs(!) owing to the vendored C implementations in blake2/benches/. We should probably decide if we want to retain all that code or figure out a different solution. It'd be nice if there were *-sys crates of the C code sourced from elsewhere which we could use for comparison instead.~~ (Edit: removed the benches for now. We can follow up on them separately)

tarcieri avatar Feb 04 '21 15:02 tarcieri

Also note: I haven't tried to impl any of the digest traits. I'd like to sort out whether we'd like to proceed with a combined single blake2 crate with submodules for blake2b and blake2s or if we can get the blake2b crate first.

@newpavlov any thoughts on this?

tarcieri avatar Feb 04 '21 15:02 tarcieri

Oof, the KATs are enormous:

$ ls -lh tests/blake2-kat.json
-rw-r--r--  1 bascule  staff   1.8M Feb  4 05:15 tests/blake2-kat.json

It seems like https://github.com/oconnor663/blake2_simd was able to keep them conveniently out-of-band by virtue of having a workspace and keeping it outside the crates themselves. ~~We should definitely do something similar to keep the crate size small.~~

Edit: moved them to .kat/blake2-kat.json in the toplevel of the repo. This unfortunately means that these tests will fail if anyone tries to test the packaged crate. Perhaps we could have e.g. a special cfg directive (e.g. --cfg kat) to enable these tests so they're ignored when testing from a released crate?

tarcieri avatar Feb 04 '21 16:02 tarcieri

The good news: it compiles and all of the original tests are now passing! 🎉

I'll update the toplevel description with some next steps.

tarcieri avatar Feb 04 '21 16:02 tarcieri

@oconnor663 @newpavlov so it turns out I was able to obtain ownership of the blake2b crate after all.

So the question is: should we proceed with a combined blake2 crate as this PR currently implements, or deprecate the blake2 crate and split it into blake2b and blake2s like blake2_simd did?

tarcieri avatar Feb 05 '21 14:02 tarcieri

Assuming that amount of code reuse between b and s modules is minimal, I am fine with the crate split. But it could be worth to postpone it until digest v0.10 release.

Oof, the KATs are enormous:

This is why we have blobby.

newpavlov avatar Feb 06 '21 08:02 newpavlov

Assuming that amount of code reuse between b and s modules is minimal...

When I was making this PR I did notice considerable overlap/duplication between the two crates. There are large swaths of the code that are completely identical. For example, the file many.rs is nearly identical, with some small changes to constants (e.g. 2/4/8), the blake2b vs blake2s names, and some small differences in comments.

I think there's quite a bit of opportunity to extract a common module shared between them which eliminates such duplication.

This is why we have blobby.

I did a quick ballparking exercise, ignoring all framing and just decoding and concatenating the hex strings of the 1.8MB blake2-kat.json file.

This still results in 827kB of binary data, which I would consider enormous and probably a bad idea to include in a crate that would otherwise be ~60kB otherwise.

tarcieri avatar Feb 06 '21 14:02 tarcieri

But it could be worth to postpone it until digest v0.10 release.

I definitely think we should land #217 first and do another release of the current blake2 crate before moving forward with this.

tarcieri avatar Feb 06 '21 15:02 tarcieri

Yes there is quite a lot of duplicated code between 2b and 2s. I experimented a couple times with refactoring it to share more, but I wound up getting frustrated with having everything in a giant macro, and I gave up and decided copy-pasting was a better result (even as a maintainer, but especially for anyone who has to read the code). Some scattered thoughts:

  • There are some quirky differences in the way the parameter blocks are laid out between 2b and 2s, particularly the "node offset", which needs different code and not just different constants. This sort of thing is annoying to model in a shared macro.
  • I don't think it's possible to share SIMD code between 2b and 2s. Of course they use different intrinsics, though maybe that could be abstracted out. But more fundamentally, the word loading in the single-block compression functions is all optimized by hand (originally by Samuel upstream), and the two versions don't look similar.
  • In contrast, most of the code that might be easier to share is the relatively straightforward parts like buffer management and the public parameters API. I'm not thrilled to maintain two copies of this code, but so far I've preferred to do that rather than make the SIMD code any harder to reason about.
  • The many.rs code is in between. It's duplicated, but it's definitely tricky code that's not easy to reason about. That might be the highest-value opportunity for finding a way to share it. But on the other hand, again, the fact that it's so tricky makes me even less excited about making it a big macro.

Maybe there are some code sharing strategies besides "giant macro" that could work here, that I haven't thought about? Maybe some interesting possibilities open up when we can use const generics?

oconnor663 avatar Feb 08 '21 17:02 oconnor663

Another scattered thought: The many APIs were super valuable to me back when I was working on BLAKE3 prototypes. Now that BLAKE3 is its own thing and doesn't depend on this code, it might be that the only existing callers of the many APIs are the BLAKE2bp and BLAKE2sp implementations also in these crates. Having a totally generic "hash many" capability is neat, but it might be that in practice no one is ever going to use it. It's possible that we could simplify the API, if it was just a private implementation detail of BLAKE2bp and BLAKE2sp. Something to think about, if many ends up getting in the way of code sharing.

oconnor663 avatar Feb 08 '21 17:02 oconnor663

@oconnor663 Thank you for your comments!

In digest v0.10 we have buffer management included behind a feature (see the core_api module). So implementation crates only have to define state initialization, blocks compression, and finalization parts.

One possible alternative to giant macros is to be generic over a private trait with necessary associated constants and types. Since Rust monomorphizes generics by default, it should be identical to the macro approach performance-wise. I have successfully refactored some macros using this approach (though const generics would have been a better fit in some cases), so it may apply here as well.

newpavlov avatar Feb 09 '21 10:02 newpavlov

I'm weakly in favor of merging the crates for the following reasons:

  1. It will provide the simplest transition path for existing users of the blake2 crate
  2. It leaves the door open for DRYing out some of the duplicated code

I think the core implementation of update_many could be converted into a sealed trait as @newpavlov suggested, with the constant value differences hoisted into associated consts. That could probably be made easier/more futureproof by making the many modules private and re-exporting the functions that comprise the public APIs from the blake2b/blake2s modules.

tarcieri avatar Feb 09 '21 14:02 tarcieri

As of fe77461, this PR implements what is effectively the sum of the blake2 and blake2b_simd+blake2s_simd APIs. I'm not removing WIP/draft yet because I want to add the full set of tests from the blake2 crate first, as the original blake2-like API only has basic smoke tests via rustdoc.

That said, it now impls blake2 9.x-compatible structs which impl traits from the crypto-mac and digest crates, in addition to providing the entire original API of blake2b_simd and blake2s_simd.

There are a few types it might be worth initially unifying, like replacing blake2b::Hash/blake2s::Hash with digest::Output and thereby dropping the arrayvec dependency, but other than that I think it makes sense to at least initially retain the full set of APIs to smooth a transition. As things stand the current implementation is largely a drop-in replacement for all three crates, which I think is pretty neat.

tarcieri avatar Aug 29 '21 03:08 tarcieri

Took a look at the clippy::ptr_offset_with_cast warningss... they all originate in arrayref, which is using offset with a usize casted to an isize instead of add.

Either arrayref needs to fix that upstream, or we need to stop using arrayref.

tarcieri avatar Aug 29 '21 17:08 tarcieri

Removing WIP/draft cc @oconnor663

This now passes all of the original tests from blake2 v0.9 and in theory should be a fully API compatible replacement.

There's one remaining item on my checklist, which is to enable tests on thumbv7em-none-eabi.

After that though, I'd say this is in good enough shape to merge. There are a couple other lingering items I'd like to address, like making the size of the Blake2* types smaller by using a better implementation of reset, but I feel like those can be addressed in small follow-up PRs rather than adding them onto this long-lived one.

tarcieri avatar Aug 29 '21 17:08 tarcieri

As of 2f45194 I have fixed the build warnings on no_std targets by gating all of the parallel implementations to only work on x86/x86_64 platforms for now.

On the blake2b_simd/blake2s_simd crates, it would appear to cause a runtime panic if you attempted to use these features on non-x86 targets.

I think it would probably be a good idea to make it possible to use those implementations on all targets, even if the underlying implementation weren't parallel, but don't see that as a blocker for this PR.

I'm going to go ahead and call this PR done barring any additional feedback. All of the checkboxes in the original description are now checked.

Please review when you have some time @newpavlov @oconnor663

tarcieri avatar Aug 29 '21 18:08 tarcieri

Either arrayref needs to fix that upstream, or we need to stop using arrayref.

All the blake2_simd code predates .try_into() array conversions in the standard library. If we want to clean all that up, and we don't have an ancient MSRV requirement, we can probably just use standard conversions now.

On the blake2b_simd/blake2s_simd crates, it would appear to cause a runtime panic if you attempted to use these features on non-x86 targets.

Could you say more here? How can I repro this panic? I notice when I try something like cross test --target thumbv7em-none-eabi --no-default-features, that the build fails because it seems like #[test] isn't implemented, which I guess makes sense. But where does the runtime panic happen?

oconnor663 avatar Aug 30 '21 17:08 oconnor663

But yes, thank you for doing all this work! I'll need to find a block of time to sit down and do it justice, probably tomorrow.

oconnor663 avatar Aug 30 '21 17:08 oconnor663

But where does the runtime panic happen?

I was referring to this panic:

https://github.com/RustCrypto/hashes/pull/228#pullrequestreview-583526150

However I fixed the issue in 2f45194713e5c9beb56191ca65c3bd0cec91e6cb by gating all of the parallel implementations to be x86-only for now.

If we want to clean all that up, and we don't have an ancient MSRV requirement, we can probably just use standard conversions now.

That sounds great, although probably something I'd want to tackle in a followup PR.

tarcieri avatar Aug 30 '21 17:08 tarcieri

Commented above about the panic. If you haven't actually seen the panic in practice (which would prove me wrong for sure), then I believe it's prevented by checks in compress_many.

oconnor663 avatar Aug 31 '21 15:08 oconnor663

I think based on https://github.com/RustCrypto/hashes/commit/2f45194713e5c9beb56191ca65c3bd0cec91e6cb the codepaths were dead/unreachable. The gating I did in that commit was necessary to eliminate all of the warnings on thumbv7em-none-eabi (since we check -Dwarnings in CI)

tarcieri avatar Aug 31 '21 15:08 tarcieri

Hmm, I think we should be able to keep the portable implementation of BLAKE2bp/sp in on those platforms without issuing warnings. What do you think of this approach: https://github.com/oconnor663/blake2_simd/commit/1c1c23d1d62e273e7cf303be405c30278077ef5b (which I've pushed to this branch https://github.com/oconnor663/blake2_simd/tree/nonx86_warnings )

oconnor663 avatar Aug 31 '21 15:08 oconnor663

Looks quite similar to 2f45194, no?

tarcieri avatar Aug 31 '21 15:08 tarcieri

If I understand correctly, the difference is that I've left the blake2bp module in place on portable targets, but https://github.com/RustCrypto/hashes/commit/2f45194713e5c9beb56191ca65c3bd0cec91e6cb removes that module entirely?

oconnor663 avatar Aug 31 '21 15:08 oconnor663

Aah, interesting. I was still getting warnings having done similar changes to yours.

tarcieri avatar Aug 31 '21 15:08 tarcieri

The grossest thing I had to do there was suppress a couple "variable doesn't need to be mut" warnings, because those variables only need to be mut in the x86 case. Hopefully that's not too sinful? :)

oconnor663 avatar Aug 31 '21 15:08 oconnor663

I can try reverting https://github.com/RustCrypto/hashes/commit/2f45194713e5c9beb56191ca65c3bd0cec91e6cb and stepping through it again

tarcieri avatar Aug 31 '21 15:08 tarcieri

By the way, looking back at this code has reminded me (or maybe made me appreciate for the first time) how intensely complicated the Jobs/hash_many infrastructure is. It's designed to support scenarios that almost no one will ever encounter (hash a collection of inputs of unequal lengths in parallel), and it's probably not worth the complexity of maintaining it. Maybe a future refactoring could rip a lot of this stuff out, and leave only the parts specifically needed for BLAKE2bp/sp? (E.g. Stride::Parallel)

If that does sound like a direction we'd like to go in the future, it might make sense to make mod many private as part of this port. That way removing it down the road wouldn't be a backwards-incompatible change. What do you think?

oconnor663 avatar Aug 31 '21 15:08 oconnor663

Sounds good to me. I'm all for reducing API surface.

tarcieri avatar Aug 31 '21 15:08 tarcieri

Apologies for the delay in getting to this. I notice that we have Blake2b, VarBlake2b, and State in the same module, with overlapping purposes. Similarly we have Hash and Params, but those are only used by State, and perhaps confusingly the Blake2b::with_params constructor doesn't involve Params. What do you think the long-term strategy will be with these API details?

oconnor663 avatar Sep 05 '21 22:09 oconnor663

Pie-in-the-sky, this has me wondering about an alternative strategy: What if we separated out the blake2_simd backends into a couple standalone crates (e.g. blake2b_backend and blake2s_backend) that provided a very low-level API, like the raw compression function and a minimal parallel compression function for BLAKE2bp/sp? That could be maintained here in the hashes repo, and then depended on by all the user-facing crates. That might let us kick the can down the road on API questions. How ridiculous would this be?

oconnor663 avatar Sep 05 '21 22:09 oconnor663

I notice that we have Blake2b, VarBlake2b, and State in the same module, with overlapping purposes. Similarly we have Hash and Params, but those are only used by State, and perhaps confusingly the Blake2b::with_params constructor doesn't involve Params. What do you think the long-term strategy will be with these API details?

There's definitely a lot of work and consolidation to be done there.

One thing that can definitely go away is Blake2* vs VarBlake2*, which is already addressed in #217.

From there the rest is largely up in the air. I'd probably suggest getting rid of State or at least making it non-pub. Hash should probably go away in favor of digest::Output and crypto_mac::Output (which would also eliminate the arrayvec dependency)

What if we separated out theblake2_simd backends into a couple standalone crates (e.g. blake2b_backend and blake2s_backend) that provided a very low-level API, like the raw compression function and a minimal parallel compression function for BLAKE2bp/sp? That could be maintained here in the hashes repo, and then depended on by all the user-facing crates. That might let us kick the can down the road on API questions. How ridiculous would this be?

Personally I'm still in favor of trying to bring everything under one roof. There is a lot of duplication right now.

The current blake2 crate does this using macros, which isn't great. An alternative to that which is worth considering IMO is writing the internals more using (sealed) traits/generics, and then wrapping that up in a higher-level API which hides the internal implementation strategy.

I don't think it's a good idea to explore that sort of thing as part of this PR, but merging everything into a single crate at least leaves the door open to such refactorings.

tarcieri avatar Sep 07 '21 13:09 tarcieri