nimbus-eth2 Investigate faster hash implementations

@potuz found some nice cherries on the hash optimization tree: https://hackmd.io/@potuz/rJX9iD30u

Recall that while we cannot update the DIGEST for one round of the algorithm without computing the previous round, we can in fact compute the scheduled words without knowing the updated DIGEST. This is particularly useful in the Merkle tree computation: we have an entire block of 64 bytes consisting of the padding block, that we know before hand. So the corresponding scheduled words SCHEDB for each round of the algorithm can be hardcoded since they are known before compile time!. The overal time savings from this idea ammounted to 40% of the total hashing time in the benchmark below.

This implementation detail is used in at least one of bitcoind’s hashing algorithms, but surprisingly seems to be missing on the existing Ethereum clients algorithms.

Jul 27 '21 09:07 arnetheduck

The interesting bit that we can potentially reuse is the hash_64b_blocks function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm

Jul 27 '21 12:07 zah

The interesting bit that we can potentially reuse is the hash_64b_blocks function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm

That assembly is only for SHA-ni capable CPUs, if there's interest in a general library implementation I could cook up some AVX2 and pure SSE versions. I wouldn't know how to do (or test) the ARM versions. It is also possible to implement them with C-intrinsics, I do not know if it's easy to import this to Nim. I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.

Jul 27 '21 12:07 potuz

I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.

I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do

Jul 28 '21 06:07 arnetheduck

FWIW, I posted SSE, AVX, AVX2 and SHA modifications to Intel's assembly. If you raise an issue with your SHA provider you may want to link to my repo if they want to take the constants from there. Here are my benchmarks running a generic release build

	Nimbus	Prysm	Lighthouse (single thread)	Mammon
NUC i5-8259U @ 2.30GHz	1112ms (AVX2)	1125ms (AVX2)	860ms (AVX2)	563 (AVX2) 596(AVX) 1042 (SSE)
Ryzen 5 @3.60 GHz	251ms (SHA)	292ms (SHA)	760ms (SHA)	161ms (SHA) 760ms (AVX2) 813ms (AVX) 1361 ms (SSE)
XPS 13 i5-7200U @ 2.50GHz	953 ms (AVX2)	1085ms (AVX2)	998ms (AVX2)	654 ms (AVX2) 681ms(AVX) 964ms (SSE)

It's true that it's Intel's assembly running on a Ryzen, but I am not buying AMD again.

I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do

choosing the right implementation at runtime is just a one liner call to cpuid, here's mammon's implementation of that (I'm only supporting reasonable CPU's though) https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L46

Aug 03 '21 00:08 potuz

somewhat old:ish paper on intel sha256 implementations: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sha-256-implementations-paper.pdf

Aug 18 '21 07:08 arnetheduck

cc @etan-status

Oct 21 '21 16:10 mratsim

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

Nov 21 '21 09:11 arnetheduck

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?

https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35

Nov 22 '21 13:11 mratsim

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?

https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35

nope, the hardcoded padding block refers to the fact that you can have the scheduled words hardcoded in the assembly. The relevant part of the assembly code is here https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L730-L738

And a little explanation of the phenomenon is in the document linked in the first description of this issue.

Nov 22 '21 14:11 potuz

We'll be using https://github.com/prysmaticlabs/hashtree at prysm, it's still too early to adopt that library, but I'd be very happy to see some benchmarks from Nimbus if you test it.

Jan 03 '22 19:01 potuz

https://github.com/status-im/nimbus-eth2/pull/5188
https://github.com/status-im/nimbus-eth2/pull/6262
https://github.com/status-im/nimbus-eth2/pull/6278
https://github.com/status-im/nimbus-eth2/pull/6292

Jun 06 '24 02:06 tersec

v24.6.0 shipped hashtree.

Jul 04 '24 06:07 tersec

nimbus-eth2 nimbus-eth2 copied to clipboard

Investigate faster hash implementations

nimbus-eth2
nimbus-eth2 copied to clipboard