NaCl.Core icon indicating copy to clipboard operation
NaCl.Core copied to clipboard

Add ChaCha and Salsa intrinsics.

Open TimothyMakkison opened this issue 3 years ago • 11 comments

Added @macaba intrinsics implementation #58 for ChaCha20. Salsa20 has since been added to NaCl, so I created Salsa20BaseIntrinsics based on macaba's work. At minimum both got a 4x speed up.

  • Added ChaCha20 & Salsa20.
  • Added benchmarks for Salsa20 and XSalsa20.
  • Added checks for intrinsics support and failovers.
  • Added VariableLengthCiphers tests, to ensure that the intrinsics methods work when using length specific methods.

This pr will require a way of testing with and without intrinsics, I'm not sure how to do this but I think COMPlus_EnableAVX is related?

A future update could speed up ProcessKeyStreamblock by using the ProcessStream method.

private static byte[] _zeros = new byte[BLOCK_SIZE_IN_BYTES];

public override void ProcessKeyStreamBlock(ReadOnlySpan<byte> nonce, int counter, Span<byte> block)
{

    if (block.Length != BLOCK_SIZE_IN_BYTES)
        throw new CryptographicException($"The key stream block length is not valid. The length in bytes must be {BLOCK_SIZE_IN_BYTES}.");
#if INTRINSICS
    if (Sse2.IsSupported)
    {
        ProcessStream(nonce, block, _zeros, counter);
        return;
    }
...

TimothyMakkison avatar Oct 11 '22 17:10 TimothyMakkison

Codecov Report

Merging #128 (d7d04d4) into master (ed3bf52) will decrease coverage by 10.75%. The diff coverage is 89.41%.

@@             Coverage Diff             @@
##           master     #128       +/-   ##
===========================================
- Coverage   99.78%   89.02%   -10.76%     
===========================================
  Files          14       26       +12     
  Lines         455     1421      +966     
  Branches       57       97       +40     
===========================================
+ Hits          454     1265      +811     
- Misses          1      146      +145     
- Partials        0       10       +10     
Impacted Files Coverage Δ
src/NaCl.Core/Base/ChaChaCore/ChaCha20Core.cs 0.00% <0.00%> (ø)
src/NaCl.Core/Base/SalsaCore/Salsa20Core.cs 0.00% <0.00%> (ø)
src/NaCl.Core/Base/Snuffle.cs 57.14% <ø> (-40.00%) :arrow_down:
src/NaCl.Core/ChaCha20.cs 100.00% <ø> (ø)
src/NaCl.Core/Salsa20.cs 100.00% <ø> (ø)
src/NaCl.Core/XChaCha20.cs 100.00% <ø> (ø)
src/NaCl.Core/XSalsa20.cs 100.00% <ø> (ø)
src/NaCl.Core/Base/ChaCha20Base.cs 73.46% <62.50%> (-26.54%) :arrow_down:
src/NaCl.Core/Base/Salsa20Base.cs 57.77% <62.50%> (-42.23%) :arrow_down:
src/NaCl.Core/Base/ChaCha20BaseIntrinsics.cs 77.77% <77.77%> (ø)
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Oct 12 '22 00:10 codecov-commenter

Thanks for the contribution and for adding Salsa20 intrinsics too! I wonder if we can reduce the method complexity of those intrinsics ones by breaking them apart. Afaik, the initial code was grabbed from another repository which I currently don't recall.

There's also XSalsa20 over there but I haven't promoted those to stable since I felt it needed more tests. The target was to also support XSalsa20Poly1305.

Regarding testing of intrinsics turned on and off, yes, it was something around COMPlus_EnableAVX, I'm not sure since I looked at this a while ago. For the CI, we might need to add tests for running with and without intrinsics I guess.

Btw, do you have some benchmark numbers you can share to see the perf increase?

I might need to take a look at why the Azure CI failed with Windows only.

daviddesmet avatar Oct 12 '22 00:10 daviddesmet

I wonder if we can reduce the method complexity of those intrinsics ones by breaking them apart.

I've split Intrinsics up by chunk size and altered ChaCha64 and Salsa64 to add HChaCha/HSalsa support. Aside from that I haven't changed the other methods. I guess I could extract some of the Xor/Store/Unpack methods and give them nicer names. I'll add ProcessKeyStreamIntrinsics, the changes to the 64 chunk methods should make it painless.

There's also XSalsa20 over there but I haven't promoted those to stable since I felt it needed more tests. The target was to also support XSalsa20Poly1305.

Regarding testing of intrinsics turned on and off, yes, it was something around COMPlus_EnableAVX, I'm not sure since I looked at this a while ago. For the CI, we might need to add tests for running with and without intrinsics I guess.

I'll take a look, I've never used env variables in unit tests before, hopefully Google will help.

Btw, do you have some benchmark numbers you can share to see the perf increase?

My numbers are from some benchmarks I did yesterday. Sorry about the quality, my laptops about to die 😅

Salsa

image

ChaCha

image

TimothyMakkison avatar Oct 12 '22 22:10 TimothyMakkison

I've split Intrinsics up by chunk size and altered ChaCha64 and Salsa64 to add HChaCha/HSalsa support. Aside from that I haven't changed the other methods. I guess I could extract some of the Xor/Store/Unpack methods and give them nicer names. I'll add ProcessKeyStreamIntrinsics, the changes to the 64 chunk methods should make it painless.

Sounds good, thanks a lot for the effort you are putting into this.

I'll take a look, I've never used env variables in unit tests before, hopefully Google will help.

If you run the benchmarks from Powershell or cmd via dotnet run, you can set the variable using set VARIABLE=VALUE before the dotnet command.

My numbers are from some benchmarks I did yesterday. Sorry about the quality, my laptops about to die 😅

Salsa

image

ChaCha

image

Thanks for sharing those, the numbers look nice :)

daviddesmet avatar Oct 13 '22 00:10 daviddesmet

I've experimented with a refactor that treats ChaCha20Base and Salsa20Base as a proxy for the internal ...Core and ...CoreItrinsics methods. It removes the intrinsics checks and preprocessor spam but arguably makes the code harder to read, I don't think I changed any externally visible apis.

I'm planning on creating some general tests for intrinsics code and relying on the CI with different flags enabled to do a thorough check. Ideally I'd configure XUnit to run the tests/test classes again with different flags but it doesn't seem supported 🤷

TimothyMakkison avatar Oct 16 '22 23:10 TimothyMakkison

@TimothyMakkison my apologies for the late follow-up, it has been hectic over here.

Thanks a ton for pulling this big task off, I added a few comments to the PR; most of them are minor stuff.

I also noticed a PowerShell script but I didn't see it being called from DevOps, is it something you intended to do? The build defined in this project uses Cake, we could set the environment variable in there afaik; I will need to investigate further into that. Perhaps we can use that and just define tests that should be running with specific environment variables without the need to update the GitHub actions.

Last but not least, I wonder if you could work on the coverage a little bit more since we dropped quite a lot with this PR, you can check codecov from above.

daviddesmet avatar Nov 08 '22 01:11 daviddesmet

Thanks for the review. I got sidetracked and forgot about this PR so it's unfinished.

I wonder if we do need the preprocessor directive since we are checking if the hardware supports it

System.Runtime.Intrinsics.X86 is only supported for .NET 3.0+. Intrinsics code won't compile for libraries that multi target versions unless we use a preprocessor.

Could you make this access modifier implicit?

Not sure what you mean by implicit, do you mean private?

We could switch to scoped namespace declaration here since we already using it on the other files =)

Should I convert all the files in the solution to scoped namespaces?

I also noticed a PowerShell script but I didn't see it being called from DevOps, is it something you intended to do? The build defined in this project uses Cake, we could set the environment variable in there afaik;

Sorry I was using the benchmark code and powershell script for testing and forgot to remove them. I'll modify the cake file.

Do you have an opinion on the refactor?

The logic up to 202ff15 was fairly simple; the intrinsics code was added with pre processing and an IsSupported check. This has the advantage of not touching the api, keeping the code/code flow the same and adding minimal additional code and files.

Commit 4d68791 separates the intrinsic and non intrinsic code from SalsaBase/ChaChaBase creating individual files for them, SalsaBase/ChaChaBase act as a facade calling the internal IChaChaCore interface for all methods. This has the disadvantage of adding a lot of code, altering a lot of code to keep the api the same, duplicating code and arguably being overly complex.

I'll probably undo the refactor unless you prefer it, I thinks it's too complicated, harder to maintain, and isn't needed if the CI will test SIMD and non SIMD versions. Should also improve the code coverage and reduce duplication.

TimothyMakkison avatar Nov 08 '22 18:11 TimothyMakkison

Thanks for the review. I got sidetracked and forgot about this PR so it's unfinished.

No worries, I got a pile of work and also lost track about your PRs, sorry about that!

I wonder if we do need the preprocessor directive since we are checking if the hardware supports it

System.Runtime.Intrinsics.X86 is only supported for .NET 3.0+. Intrinsics code won't compile for libraries that multi target versions unless we use a preprocessor.

I think we can drop .NET 3 and just stick with the supported versions going forward. Supporting obsolete versions of .NET just adds a lot of maintainability headaches.

Could you make this access modifier implicit?

Not sure what you mean by implicit, do you mean private?

Yes, I mean adding to those the private modifier.

We could switch to scoped namespace declaration here since we already using it on the other files =)

Should I convert all the files in the solution to scoped namespaces?

I would opt to have consistency in all files whenever we can.

I also noticed a PowerShell script but I didn't see it being called from DevOps, is it something you intended to do? The build defined in this project uses Cake, we could set the environment variable in there afaik;

Sorry I was using the benchmark code and powershell script for testing and forgot to remove them. I'll modify the cake file.

No worries, it was just a question since I didn't know the purpose of it, you could leave that one if you want.

Do you have an opinion on the refactor?

The logic up to 202ff15 was fairly simple; the intrinsics code was added with pre processing and an IsSupported check. This has the advantage of not touching the api, keeping the code/code flow the same and adding minimal additional code and files.

Commit 4d68791 separates the intrinsic and non intrinsic code from SalsaBase/ChaChaBase creating individual files for them, SalsaBase/ChaChaBase act as a facade calling the internal IChaChaCore interface for all methods. This has the disadvantage of adding a lot of code, altering a lot of code to keep the api the same, duplicating code and arguably being overly complex.

I'll probably undo the refactor unless you prefer it, I thinks it's too complicated, harder to maintain, and isn't needed if the CI will test SIMD and non SIMD versions. Should also improve the code coverage and reduce duplication.

Yeah, I agree with that. It added a lot of maintainability complexity, if we limit the scope of the supported .NET versions we can stick with the first logic you proposed.

daviddesmet avatar Nov 09 '22 00:11 daviddesmet

Yeah, I agree with that. It added a lot of maintainability complexity, if we limit the scope of the supported .NET versions we can stick with the first logic you proposed. I think we can drop .NET 3 and just stick with the supported versions going forward. Supporting obsolete versions of .NET just adds a lot of maintainability headaches.

The "refactor" was added to make the code testable, with the CI and cake file this thankfully isn't needed. This pr is compatible with all the different versoins of .NET it just requires a couple of pre processor checks.

  • Updated the cake file, NET 7 RC 2 has broken my sdk so I can't install the cake tool and test it. Hopefully it works 🤞
  • Reverted the changes to Salsa20Base and ChaCha20Base.
  • Deleted intrinsic/scalar tests.
  • Optimized the shuffle function for Salsa64, this used a different instruction set so I updated the guard checks.
  • To reduce the scope of this PR I haven't updated all the namespaces. I'm tempted to create a code cleanup pr changing the namespaces, exception parameters, empty arrays and unused fields etc.

Do you think it would be worth getting someone who knows intrinsics to double check this? This was my first time using intrinisics/unsafe code in this way.

TimothyMakkison avatar Nov 09 '22 15:11 TimothyMakkison

  • To reduce the scope of this PR I haven't updated all the namespaces. I'm tempted to create a code cleanup pr changing the namespaces, exception parameters, empty arrays and unused fields etc.

Yeah, sounds great!

Do you think it would be worth getting someone who knows intrinsics to double check this? This was my first time using intrinisics/unsafe code in this way.

That would be nice but I really don't know anyone from my side. :/

daviddesmet avatar Nov 17 '22 00:11 daviddesmet

I had the cunning idea of submitting a pr to the bouncycastle project. Give them a performance bump for large cipherstreams and have another set of eyes look at the implementations. Once I get a grasp of IDigest I'll make a pr

TimothyMakkison avatar Nov 17 '22 00:11 TimothyMakkison