bc-csharp
bc-csharp copied to clipboard
Add Blake2b intrinsics POC
Adapted @saucecontrol's Blake2Fast intrinsics code to work with the Konscious.Security.Cryptography library and thought it would work here.
The benchmark measured a 6x-13x speed up without additional memory usage. Blake2s or Blake2xs would probably experience similar benefits.
Non intrinsics
| Method | Size | Mean | Error | StdDev | Median | Gen0 | Allocated |
|---|---|---|---|---|---|---|---|
| GetHash | 3 | 2.270 μs | 0.0804 μs | 0.2372 μs | 2.337 μs | 0.1106 | 176 B |
| GetHash | 100 | 2.082 μs | 0.0241 μs | 0.0214 μs | 2.083 μs | 0.1106 | 176 B |
| GetHash | 1000 | 15.085 μs | 0.8234 μs | 2.4279 μs | 15.903 μs | 0.0916 | 176 B |
| GetHash | 10000 | 155.035 μs | 2.8955 μs | 4.1527 μs | 155.360 μs | - | 176 B |
Intrinsics
| Method | Size | Mean | Error | StdDev | Median | Gen0 | Allocated |
|---|---|---|---|---|---|---|---|
| GetHash | 3 | 356.7 ns | 11.23 ns | 33.12 ns | 367.6 ns | 0.1121 | 176 B |
| GetHash | 100 | 341.0 ns | 6.96 ns | 20.09 ns | 343.6 ns | 0.1121 | 176 B |
| GetHash | 1000 | 1,439.0 ns | 27.08 ns | 22.61 ns | 1,434.5 ns | 0.1106 | 176 B |
| GetHash | 10000 | 12,796.6 ns | 259.71 ns | 765.76 ns | 13,144.9 ns | 0.1068 | 176 B |
Note that I'm new to intrinsics and CompilerServices.Unsafe, this shouldn't be considered complete or safe. Do you have a CI pipeline for testing intrinsic code?
Great numbers, but suggest the following changes:
- Move all intrinsics code to a separate file (Blake2b_X86).
- Put intrinsics code under "#if NETCOREAPP3_0_OR_GREATER", which AFAIK is the correct minimal version.
- Add a licensing comment if the code is a substantial copy of either of the mentioned projects.
- Add IsSupported method to the new class and use that from Blake2bDigest
- Extract round function into a method (AggressiveInlining is still fine)
- Probably can replace most of the Unsafe methods with MemoryMarshal methods
At least add the license comment, the rest I can take care of if necessary.
- [x] Move all intrinsics code to a separate file (Blake2b_X86).
- [x] Put intrinsics code under "#if NETCOREAPP3_0_OR_GREATER", which AFAIK is the correct minimal version.
- [x] Add a licensing comment if the code is a substantial copy of either of the mentioned projects.
- [ ] Add IsSupported method to the new class and use that from Blake2bDigest
- [ ] Extract round function into a method (AggressiveInlining is still fine)
- [x] Probably can replace most of the Unsafe methods with MemoryMarshal methods
Added Blake2s_X86.
Blake2s
Non intrinsics
| Method | Size | Mean | Error | StdDev | Median | Gen0 | Allocated |
|---|---|---|---|---|---|---|---|
| GetHash | 3 | 1.872 μs | 0.0935 μs | 0.2758 μs | 1.987 μs | 0.1450 | 232 B |
| GetHash | 100 | 3.686 μs | 0.0737 μs | 0.1523 μs | 3.720 μs | 0.1450 | 232 B |
| GetHash | 1000 | 28.101 μs | 0.5315 μs | 1.0857 μs | 28.234 μs | 0.1221 | 232 B |
| GetHash | 10000 | 204.365 μs | 9.7001 μs | 28.6010 μs | 200.846 μs | - | 232 B |
Intrinsics
| Method | Size | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|---|
| GetHash | 3 | 347.4 ns | 13.51 ns | 39.85 ns | 0.1478 | 232 B |
| GetHash | 100 | 501.3 ns | 12.88 ns | 37.99 ns | 0.1478 | 232 B |
| GetHash | 1000 | 2,545.4 ns | 38.75 ns | 34.35 ns | 0.1450 | 232 B |
| GetHash | 10000 | 23,501.2 ns | 465.85 ns | 653.06 ns | 0.1221 | 232 B |
Blake2xs
Non intrinsics
| Method | Size | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|---|
| GetHash | 3 | 2.306 μs | 0.0315 μs | 0.0295 μs | 0.5684 | 896 B |
| GetHash | 100 | 3.201 μs | 0.0413 μs | 0.0366 μs | 0.5684 | 896 B |
| GetHash | 1000 | 15.635 μs | 0.1960 μs | 0.1834 μs | 0.5493 | 896 B |
| GetHash | 10000 | 135.593 μs | 1.6090 μs | 1.5051 μs | 0.4883 | 896 B |
Intrinsics
| Method | Size | Mean | Error | StdDev | Median | Gen0 | Allocated |
|---|---|---|---|---|---|---|---|
| GetHash | 3 | 1.072 μs | 0.0193 μs | 0.0198 μs | 1.070 μs | 0.5703 | 896 B |
| GetHash | 100 | 1.091 μs | 0.0307 μs | 0.0906 μs | 1.109 μs | 0.5703 | 896 B |
| GetHash | 1000 | 2.858 μs | 0.1685 μs | 0.4969 μs | 3.035 μs | 0.5684 | 896 B |
| GetHash | 10000 | 16.398 μs | 0.1364 μs | 0.1675 μs | 16.444 μs | 0.5493 | 896 B |
There was no overlap between Blake2b and Blake2s so I deleted VectorExtensions and inlined the methods.
Suggestions
Add a length check to DoFinal to ensure that output spans and arrays are long enough.
There are several repeat implementations of Store128, Load128, Store128_UInt32, perhaps a shared helper method would help.
Note that intrinsics aren't available for NETSTANDARD2_1_OR_GREATER, so those directives should be removed (just NETCOREAPP3_0_OR_GREATER).
I'd still like methods on the _X86 classes, e.g.: public static bool IsSupported => Avx2.IsSupported && BitConverter.IsLittleEndian;
although ideally BitConverter.IsLittleEndian isn't a requirement and only affects Load/Store (this could fall out later when I factor out all the load/store methods and support bigendian).
Also if all the rounds have the same structure (or a small set thereof), please factor out a method for 1 round (using ref parameters where each round applies to different parts of the state).
I've added length checks that you suggested also (this mirror can take a few days to be updated though).
- [x] Note that intrinsics aren't available for NETSTANDARD2_1_OR_GREATER, so those directives should be removed (just NETCOREAPP3_0_OR_GREATER).
- [x] I'd still like methods on the _X86 classes, e.g.:
public static bool IsSupported => Avx2.IsSupported && BitConverter.IsLittleEndian;
Also if all the rounds have the same structure (or a small set thereof), please factor out a method for 1 round (using ref parameters where each round applies to different parts of the state).
I've created a Round function containing the diagonalization and G calls, The remaining logic is loading the b0-b4 values, as far as I can tell this can't be fruther simplified as each round is subtly different.
I've added length checks that you suggested also (this mirror can take a few days to be updated though).
Thanks, it was driving me nuts when debugging the benchmarks
Should I update the xml documentation? It looks like the docs haven't been updated since being converted from java. Are you okay with using <inheritdoc />?
/// <inheritdoc />
public void Update(byte b)
{
I've converted the javadocs to xml form. I've used <inheritdoc /> where possible and added summaries when the existing javadoc differed from the IDigest description.
Nitpick: I've noticed that a couple of functions such as Update(byte b) have different parameter names to the interface IDigest Update(byte input). I haven't altered the names as this would be a breaking change, but it might be worth updating them to match the interface.
Is this ready for review/merge?
I'll hold off for and convert this to a draft. Want to double check I haven't added a regression for .Net 3