ssz using Vector capabilities of the CPU for sha256 in ssz merkelization of lists

In discussion with @potuz, it was discovered that there is scope for using capabilities of SIMD enabled processors, use case: ssz merkalization of the lists for which @potuz has reported 10x improvment.

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12                  160       7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12              15      74012328 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4               68      26677965 ns/op
BenchmarkHashBalancePrysm-4              7     165434686 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4             121       9711482 ns/op
BenchmarkHashBalancePrysm-4             10     103716714 ns/op
PASS

Reference Links: https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L635-L659 https://github.com/potuz/mammon/blob/main/ssz/hasher.hpp#L27

Based on this, digged through to realize that assembly script has support for SIMD vector processing: https://v8.dev/features/simd There are two was this can be done:

Via compiler flags for auto optimization of vector loops for single digest
Via using assembly script wrapper functions to vectorize the computation for parallelizing multiple digest processings (the approach followed by @potuz in his reference implementation, more optimal wherever multi digest & SIMD compatible workload available)

Task:

[ ] Investigate and get familiar SIMD support directives in assembly script
[ ] Investigate and develop if possible, loop parallelization for SIMD
[ ] Investigate and develop multiple digest feeds
[ ] Integrate the multi digest support in ssz merkelization of lists

Nov 22 '21 08:11 g11tech

Oh damn! :rocket: How can check if my host supports SIMD?

Nov 23 '21 09:11 dapplion

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Nov 23 '21 09:11 potuz

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Thank you!

$ cpuid
CPU 0:
...
   feature information (1/edx):
...
      SSE extensions                         = true
      SSE2 extensions                        = true
   feature information (1/ecx):
...
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true

:heart_eyes: how common is support on modern CPUs?

Nov 23 '21 09:11 dapplion

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Thank you!
$ cpuid
CPU 0:
...
   feature information (1/edx):
...
      SSE extensions                         = true
      SSE2 extensions                        = true
   feature information (1/ecx):
...
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true
heart_eyes how common is support on modern CPUs?

I'm making the case to implement in prysm expecting at least SSE3 which has been the standard since 2004/5 I don't expect a single CPU out there without SSE3 actually staking. In practical terms, I don't think there's a single one without AVX. This is Intel speaking. I haven't looked yet into ARM assembly.

Nov 23 '21 10:11 potuz

That's huge then! Would love to see this in Lodestar.

I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement @g11tech

https://github.com/ChainSafe/lodestar/issues/2206

Nov 23 '21 10:11 dapplion

That's huge then! Would love to see this in Lodestar.

I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement

ChainSafe/lodestar#2206

This requires both changes in the assembly to return buffers with all roots at the same time, and changes in the hashing logic to call the whole block at the same time instead of pairwise leaves. I put out a stupid implementation in the design document, surely it can be improved, but this is already giving those x10 benches against production prysm on large lists: https://hackmd.io/@potuz/BJyrx9DOF

Nov 23 '21 10:11 potuz

We'll most probably be using https://github.com/prysmaticlabs/hashtree, It's on very early stages of development, but I'll be happy to see some benchmarks from Lodestar if you could test it. If you decide to use it I'll be happy to provide bindings or whatever you need.

Jan 03 '22 19:01 potuz

ssz ssz copied to clipboard

using Vector capabilities of the CPU for sha256 in ssz merkelization of lists

ssz
ssz copied to clipboard