nimbus-eth2
nimbus-eth2 copied to clipboard
Investigate faster hash implementations
@potuz found some nice cherries on the hash optimization tree: https://hackmd.io/@potuz/rJX9iD30u
Recall that while we cannot update the DIGEST for one round of the algorithm without computing the previous round, we can in fact compute the scheduled words without knowing the updated DIGEST. This is particularly useful in the Merkle tree computation: we have an entire block of 64 bytes consisting of the padding block, that we know before hand. So the corresponding scheduled words SCHEDB for each round of the algorithm can be hardcoded since they are known before compile time!. The overal time savings from this idea ammounted to 40% of the total hashing time in the benchmark below.
This implementation detail is used in at least one of bitcoind’s hashing algorithms, but surprisingly seems to be missing on the existing Ethereum clients algorithms.
The interesting bit that we can potentially reuse is the hash_64b_blocks
function defined here:
https://github.com/potuz/mammon/blob/main/ssz/sha256.asm
The interesting bit that we can potentially reuse is the
hash_64b_blocks
function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm
That assembly is only for SHA-ni capable CPUs, if there's interest in a general library implementation I could cook up some AVX2 and pure SSE versions. I wouldn't know how to do (or test) the ARM versions. It is also possible to implement them with C-intrinsics, I do not know if it's easy to import this to Nim. I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.
I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.
I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do
FWIW, I posted SSE, AVX, AVX2 and SHA modifications to Intel's assembly. If you raise an issue with your SHA provider you may want to link to my repo if they want to take the constants from there. Here are my benchmarks running a generic release build
Nimbus | Prysm | Lighthouse (single thread) | Mammon | |
---|---|---|---|---|
NUC i5-8259U @ 2.30GHz | 1112ms (AVX2) | 1125ms (AVX2) | 860ms (AVX2) | 563 (AVX2) 596(AVX) 1042 (SSE) |
Ryzen 5 @3.60 GHz | 251ms (SHA) | 292ms (SHA) | 760ms (SHA) | 161ms (SHA) 760ms (AVX2) 813ms (AVX) 1361 ms (SSE) |
XPS 13 i5-7200U @ 2.50GHz | 953 ms (AVX2) | 1085ms (AVX2) | 998ms (AVX2) | 654 ms (AVX2) 681ms(AVX) 964ms (SSE) |
It's true that it's Intel's assembly running on a Ryzen, but I am not buying AMD again.
I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do
choosing the right implementation at runtime is just a one liner call to cpuid, here's mammon's implementation of that (I'm only supporting reasonable CPU's though) https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L46
somewhat old:ish paper on intel sha256 implementations: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sha-256-implementations-paper.pdf
cc @etan-status
more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF
more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF
For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?
https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35
more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF
For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?
https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35
nope, the hardcoded padding block refers to the fact that you can have the scheduled words hardcoded in the assembly. The relevant part of the assembly code is here https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L730-L738
And a little explanation of the phenomenon is in the document linked in the first description of this issue.
We'll be using https://github.com/prysmaticlabs/hashtree at prysm, it's still too early to adopt that library, but I'd be very happy to see some benchmarks from Nimbus if you test it.
- https://github.com/status-im/nimbus-eth2/pull/5188
- https://github.com/status-im/nimbus-eth2/pull/6262
- https://github.com/status-im/nimbus-eth2/pull/6278
- https://github.com/status-im/nimbus-eth2/pull/6292
v24.6.0 shipped hashtree.