hashes
hashes copied to clipboard
blake2: integrate `blake2b_simd`/`blake2s_simd` crates
Integrates original sources from these crates, which provide AVX2-accelerated SIMD backends:
https://github.com/oconnor663/blake2_simd
Taken from this commit:
- Hash: 7bf791e67245bb84132d1ee0e6a893bb8c85c093
- Author: Jack O'Connor [email protected]
- Date: Fri Nov 13 15:50:16 2020 -0500
- Title: AES-CTR benchmarks
Next steps
- [x] Determine if we want to proceed with a combined
blake2
crate or if we can get theblake2b
crate and useblake2b
/blake2s
- [x] Define
Blake2b
/Blake2s
structs which provide impls of thedigest
traits. These can also be used to replace theblake2::blake2b::blake2b
/blake2::blake2s::blake2s
static functions e.g. withBlake2b::digest(...)
- [x] Also define
VarBlake2b
/VarBlake2s
(?) - [x] Address compile errors/panics on
no_std
targets - [x] Figure out solution for handling KATs
- [x] ~~Address clippy errors (presently TODOs) in the PR~~ (caused by
arrayref
, can circle back on that)
As discussed in #88, this PR merges the blake2b_simd
and blake2s_simd
crates into the blake2
crate, placing each algorithm under blake2b
and blake2s
modules respectively, and promoting the blake2bp
and blake2sp
modules to the toplevel of the crate.
Alternatively it might be possible to retain the two-crate structure, however unfortunately while we own the blake2s
crate, we don't own the blake2b
crate. I have contacted its current owner (as presently all releases are yanked so it's unused) to see if we can obtain it, in which case we can do a more straightforward import of blake2b_simd
=> blake2b
and blake2s_simd
=> blake2s
.
~~This PR is presently about 170 klocs(!) owing to the vendored C implementations in blake2/benches/
. We should probably decide if we want to retain all that code or figure out a different solution. It'd be nice if there were *-sys
crates of the C code sourced from elsewhere which we could use for comparison instead.~~ (Edit: removed the benches for now. We can follow up on them separately)
Also note: I haven't tried to impl any of the digest
traits. I'd like to sort out whether we'd like to proceed with a combined single blake2
crate with submodules for blake2b
and blake2s
or if we can get the blake2b
crate first.
@newpavlov any thoughts on this?
Oof, the KATs are enormous:
$ ls -lh tests/blake2-kat.json
-rw-r--r-- 1 bascule staff 1.8M Feb 4 05:15 tests/blake2-kat.json
It seems like https://github.com/oconnor663/blake2_simd was able to keep them conveniently out-of-band by virtue of having a workspace and keeping it outside the crates themselves. ~~We should definitely do something similar to keep the crate size small.~~
Edit: moved them to .kat/blake2-kat.json
in the toplevel of the repo. This unfortunately means that these tests will fail if anyone tries to test the packaged crate. Perhaps we could have e.g. a special cfg
directive (e.g. --cfg kat
) to enable these tests so they're ignored when testing from a released crate?
The good news: it compiles and all of the original tests are now passing! 🎉
I'll update the toplevel description with some next steps.
@oconnor663 @newpavlov so it turns out I was able to obtain ownership of the blake2b
crate after all.
So the question is: should we proceed with a combined blake2
crate as this PR currently implements, or deprecate the blake2
crate and split it into blake2b
and blake2s
like blake2_simd
did?
Assuming that amount of code reuse between b
and s
modules is minimal, I am fine with the crate split. But it could be worth to postpone it until digest v0.10 release.
Oof, the KATs are enormous:
This is why we have blobby
.
Assuming that amount of code reuse between
b
ands
modules is minimal...
When I was making this PR I did notice considerable overlap/duplication between the two crates. There are large swaths of the code that are completely identical. For example, the file many.rs
is nearly identical, with some small changes to constants (e.g. 2/4/8), the blake2b
vs blake2s
names, and some small differences in comments.
I think there's quite a bit of opportunity to extract a common
module shared between them which eliminates such duplication.
This is why we have
blobby
.
I did a quick ballparking exercise, ignoring all framing and just decoding and concatenating the hex strings of the 1.8MB blake2-kat.json
file.
This still results in 827kB of binary data, which I would consider enormous and probably a bad idea to include in a crate that would otherwise be ~60kB otherwise.
But it could be worth to postpone it until digest v0.10 release.
I definitely think we should land #217 first and do another release of the current blake2
crate before moving forward with this.
Yes there is quite a lot of duplicated code between 2b and 2s. I experimented a couple times with refactoring it to share more, but I wound up getting frustrated with having everything in a giant macro, and I gave up and decided copy-pasting was a better result (even as a maintainer, but especially for anyone who has to read the code). Some scattered thoughts:
- There are some quirky differences in the way the parameter blocks are laid out between 2b and 2s, particularly the "node offset", which needs different code and not just different constants. This sort of thing is annoying to model in a shared macro.
- I don't think it's possible to share SIMD code between 2b and 2s. Of course they use different intrinsics, though maybe that could be abstracted out. But more fundamentally, the word loading in the single-block compression functions is all optimized by hand (originally by Samuel upstream), and the two versions don't look similar.
- In contrast, most of the code that might be easier to share is the relatively straightforward parts like buffer management and the public parameters API. I'm not thrilled to maintain two copies of this code, but so far I've preferred to do that rather than make the SIMD code any harder to reason about.
- The
many.rs
code is in between. It's duplicated, but it's definitely tricky code that's not easy to reason about. That might be the highest-value opportunity for finding a way to share it. But on the other hand, again, the fact that it's so tricky makes me even less excited about making it a big macro.
Maybe there are some code sharing strategies besides "giant macro" that could work here, that I haven't thought about? Maybe some interesting possibilities open up when we can use const generics?
Another scattered thought: The many
APIs were super valuable to me back when I was working on BLAKE3 prototypes. Now that BLAKE3 is its own thing and doesn't depend on this code, it might be that the only existing callers of the many
APIs are the BLAKE2bp and BLAKE2sp implementations also in these crates. Having a totally generic "hash many" capability is neat, but it might be that in practice no one is ever going to use it. It's possible that we could simplify the API, if it was just a private implementation detail of BLAKE2bp and BLAKE2sp. Something to think about, if many
ends up getting in the way of code sharing.
@oconnor663 Thank you for your comments!
In digest
v0.10 we have buffer management included behind a feature (see the core_api
module). So implementation crates only have to define state initialization, blocks compression, and finalization parts.
One possible alternative to giant macros is to be generic over a private trait with necessary associated constants and types. Since Rust monomorphizes generics by default, it should be identical to the macro approach performance-wise. I have successfully refactored some macros using this approach (though const generics would have been a better fit in some cases), so it may apply here as well.
I'm weakly in favor of merging the crates for the following reasons:
- It will provide the simplest transition path for existing users of the
blake2
crate - It leaves the door open for DRYing out some of the duplicated code
I think the core implementation of update_many
could be converted into a sealed trait as @newpavlov suggested, with the constant value differences hoisted into associated consts. That could probably be made easier/more futureproof by making the many
modules private and re-exporting the functions that comprise the public APIs from the blake2b
/blake2s
modules.
As of fe77461, this PR implements what is effectively the sum of the blake2
and blake2b_simd
+blake2s_simd
APIs. I'm not removing WIP/draft yet because I want to add the full set of tests from the blake2
crate first, as the original blake2
-like API only has basic smoke tests via rustdoc.
That said, it now impls blake2
9.x-compatible structs which impl traits from the crypto-mac
and digest
crates, in addition to providing the entire original API of blake2b_simd
and blake2s_simd
.
There are a few types it might be worth initially unifying, like replacing blake2b::Hash
/blake2s::Hash
with digest::Output
and thereby dropping the arrayvec
dependency, but other than that I think it makes sense to at least initially retain the full set of APIs to smooth a transition. As things stand the current implementation is largely a drop-in replacement for all three crates, which I think is pretty neat.
Took a look at the clippy::ptr_offset_with_cast
warningss... they all originate in arrayref
, which is using offset
with a usize
casted to an isize
instead of add
.
Either arrayref
needs to fix that upstream, or we need to stop using arrayref
.
Removing WIP/draft cc @oconnor663
This now passes all of the original tests from blake2
v0.9 and in theory should be a fully API compatible replacement.
There's one remaining item on my checklist, which is to enable tests on thumbv7em-none-eabi
.
After that though, I'd say this is in good enough shape to merge. There are a couple other lingering items I'd like to address, like making the size of the Blake2*
types smaller by using a better implementation of reset
, but I feel like those can be addressed in small follow-up PRs rather than adding them onto this long-lived one.
As of 2f45194 I have fixed the build warnings on no_std
targets by gating all of the parallel implementations to only work on x86
/x86_64
platforms for now.
On the blake2b_simd
/blake2s_simd
crates, it would appear to cause a runtime panic if you attempted to use these features on non-x86 targets.
I think it would probably be a good idea to make it possible to use those implementations on all targets, even if the underlying implementation weren't parallel, but don't see that as a blocker for this PR.
I'm going to go ahead and call this PR done barring any additional feedback. All of the checkboxes in the original description are now checked.
Please review when you have some time @newpavlov @oconnor663
Either arrayref needs to fix that upstream, or we need to stop using arrayref.
All the blake2_simd code predates .try_into()
array conversions in the standard library. If we want to clean all that up, and we don't have an ancient MSRV requirement, we can probably just use standard conversions now.
On the blake2b_simd/blake2s_simd crates, it would appear to cause a runtime panic if you attempted to use these features on non-x86 targets.
Could you say more here? How can I repro this panic? I notice when I try something like cross test --target thumbv7em-none-eabi --no-default-features
, that the build fails because it seems like #[test]
isn't implemented, which I guess makes sense. But where does the runtime panic happen?
But yes, thank you for doing all this work! I'll need to find a block of time to sit down and do it justice, probably tomorrow.
But where does the runtime panic happen?
I was referring to this panic:
https://github.com/RustCrypto/hashes/pull/228#pullrequestreview-583526150
However I fixed the issue in 2f45194713e5c9beb56191ca65c3bd0cec91e6cb by gating all of the parallel implementations to be x86-only for now.
If we want to clean all that up, and we don't have an ancient MSRV requirement, we can probably just use standard conversions now.
That sounds great, although probably something I'd want to tackle in a followup PR.
Commented above about the panic. If you haven't actually seen the panic in practice (which would prove me wrong for sure), then I believe it's prevented by checks in compress_many
.
I think based on https://github.com/RustCrypto/hashes/commit/2f45194713e5c9beb56191ca65c3bd0cec91e6cb the codepaths were dead/unreachable. The gating I did in that commit was necessary to eliminate all of the warnings on thumbv7em-none-eabi
(since we check -Dwarnings
in CI)
Hmm, I think we should be able to keep the portable implementation of BLAKE2bp/sp in on those platforms without issuing warnings. What do you think of this approach: https://github.com/oconnor663/blake2_simd/commit/1c1c23d1d62e273e7cf303be405c30278077ef5b (which I've pushed to this branch https://github.com/oconnor663/blake2_simd/tree/nonx86_warnings )
Looks quite similar to 2f45194, no?
If I understand correctly, the difference is that I've left the blake2bp
module in place on portable targets, but https://github.com/RustCrypto/hashes/commit/2f45194713e5c9beb56191ca65c3bd0cec91e6cb removes that module entirely?
Aah, interesting. I was still getting warnings having done similar changes to yours.
The grossest thing I had to do there was suppress a couple "variable doesn't need to be mut
" warnings, because those variables only need to be mut
in the x86 case. Hopefully that's not too sinful? :)
I can try reverting https://github.com/RustCrypto/hashes/commit/2f45194713e5c9beb56191ca65c3bd0cec91e6cb and stepping through it again
By the way, looking back at this code has reminded me (or maybe made me appreciate for the first time) how intensely complicated the Jobs
/hash_many
infrastructure is. It's designed to support scenarios that almost no one will ever encounter (hash a collection of inputs of unequal lengths in parallel), and it's probably not worth the complexity of maintaining it. Maybe a future refactoring could rip a lot of this stuff out, and leave only the parts specifically needed for BLAKE2bp/sp? (E.g. Stride::Parallel
)
If that does sound like a direction we'd like to go in the future, it might make sense to make mod many
private as part of this port. That way removing it down the road wouldn't be a backwards-incompatible change. What do you think?
Sounds good to me. I'm all for reducing API surface.
Apologies for the delay in getting to this. I notice that we have Blake2b
, VarBlake2b
, and State
in the same module, with overlapping purposes. Similarly we have Hash
and Params
, but those are only used by State
, and perhaps confusingly the Blake2b::with_params
constructor doesn't involve Params
. What do you think the long-term strategy will be with these API details?
Pie-in-the-sky, this has me wondering about an alternative strategy: What if we separated out the blake2_simd
backends into a couple standalone crates (e.g. blake2b_backend
and blake2s_backend
) that provided a very low-level API, like the raw compression function and a minimal parallel compression function for BLAKE2bp/sp? That could be maintained here in the hashes
repo, and then depended on by all the user-facing crates. That might let us kick the can down the road on API questions. How ridiculous would this be?
I notice that we have Blake2b, VarBlake2b, and State in the same module, with overlapping purposes. Similarly we have Hash and Params, but those are only used by State, and perhaps confusingly the Blake2b::with_params constructor doesn't involve Params. What do you think the long-term strategy will be with these API details?
There's definitely a lot of work and consolidation to be done there.
One thing that can definitely go away is Blake2*
vs VarBlake2*
, which is already addressed in #217.
From there the rest is largely up in the air. I'd probably suggest getting rid of State
or at least making it non-pub
. Hash
should probably go away in favor of digest::Output
and crypto_mac::Output
(which would also eliminate the arrayvec
dependency)
What if we separated out the
blake2_simd
backends into a couple standalone crates (e.g.blake2b_backend
andblake2s_backend
) that provided a very low-level API, like the raw compression function and a minimal parallel compression function for BLAKE2bp/sp? That could be maintained here in the hashes repo, and then depended on by all the user-facing crates. That might let us kick the can down the road on API questions. How ridiculous would this be?
Personally I'm still in favor of trying to bring everything under one roof. There is a lot of duplication right now.
The current blake2
crate does this using macros, which isn't great. An alternative to that which is worth considering IMO is writing the internals more using (sealed) traits/generics, and then wrapping that up in a higher-level API which hides the internal implementation strategy.
I don't think it's a good idea to explore that sort of thing as part of this PR, but merging everything into a single crate at least leaves the door open to such refactorings.