ssz
ssz copied to clipboard
using Vector capabilities of the CPU for sha256 in ssz merkelization of lists
In discussion with @potuz, it was discovered that there is scope for using capabilities of SIMD enabled processors, use case: ssz merkalization of the lists for which @potuz has reported 10x improvment.
goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12 160 7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12 15 74012328 ns/op
PASS
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4 68 26677965 ns/op
BenchmarkHashBalancePrysm-4 7 165434686 ns/op
PASS
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4 121 9711482 ns/op
BenchmarkHashBalancePrysm-4 10 103716714 ns/op
PASS
Reference Links: https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L635-L659 https://github.com/potuz/mammon/blob/main/ssz/hasher.hpp#L27
Based on this, digged through to realize that assembly script has support for SIMD vector processing: https://v8.dev/features/simd There are two was this can be done:
- Via compiler flags for auto optimization of vector loops for single digest
- Via using assembly script wrapper functions to vectorize the computation for parallelizing multiple digest processings (the approach followed by @potuz in his reference implementation, more optimal wherever multi digest & SIMD compatible workload available)
Task:
- [ ] Investigate and get familiar SIMD support directives in assembly script
- [ ] Investigate and develop if possible, loop parallelization for SIMD
- [ ] Investigate and develop multiple digest feeds
- [ ] Integrate the multi digest support in ssz merkelization of lists
Oh damn! :rocket: How can check if my host supports SIMD?
Oh damn! rocket How can check if my host supports SIMD?
cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55
Oh damn! rocket How can check if my host supports SIMD?
cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55
Thank you!
$ cpuid
CPU 0:
...
feature information (1/edx):
...
SSE extensions = true
SSE2 extensions = true
feature information (1/ecx):
...
SSE4.1 extensions = true
SSE4.2 extensions = true
:heart_eyes: how common is support on modern CPUs?
Oh damn! rocket How can check if my host supports SIMD?
cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55
Thank you!
$ cpuid CPU 0: ... feature information (1/edx): ... SSE extensions = true SSE2 extensions = true feature information (1/ecx): ... SSE4.1 extensions = true SSE4.2 extensions = true
heart_eyes how common is support on modern CPUs?
I'm making the case to implement in prysm expecting at least SSE3 which has been the standard since 2004/5 I don't expect a single CPU out there without SSE3 actually staking. In practical terms, I don't think there's a single one without AVX. This is Intel speaking. I haven't looked yet into ARM assembly.
That's huge then! Would love to see this in Lodestar.
I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement @g11tech
https://github.com/ChainSafe/lodestar/issues/2206
That's huge then! Would love to see this in Lodestar.
I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement
This requires both changes in the assembly to return buffers with all roots at the same time, and changes in the hashing logic to call the whole block at the same time instead of pairwise leaves. I put out a stupid implementation in the design document, surely it can be improved, but this is already giving those x10 benches against production prysm on large lists: https://hackmd.io/@potuz/BJyrx9DOF
We'll most probably be using https://github.com/prysmaticlabs/hashtree, It's on very early stages of development, but I'll be happy to see some benchmarks from Lodestar if you could test it. If you decide to use it I'll be happy to provide bindings or whatever you need.