nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Investigate faster hash implementations

Open arnetheduck opened this issue 3 years ago • 11 comments

@potuz found some nice cherries on the hash optimization tree: https://hackmd.io/@potuz/rJX9iD30u

Recall that while we cannot update the DIGEST for one round of the algorithm without computing the previous round, we can in fact compute the scheduled words without knowing the updated DIGEST. This is particularly useful in the Merkle tree computation: we have an entire block of 64 bytes consisting of the padding block, that we know before hand. So the corresponding scheduled words SCHEDB for each round of the algorithm can be hardcoded since they are known before compile time!. The overal time savings from this idea ammounted to 40% of the total hashing time in the benchmark below.

This implementation detail is used in at least one of bitcoind’s hashing algorithms, but surprisingly seems to be missing on the existing Ethereum clients algorithms.

arnetheduck avatar Jul 27 '21 09:07 arnetheduck

The interesting bit that we can potentially reuse is the hash_64b_blocks function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm

zah avatar Jul 27 '21 12:07 zah

The interesting bit that we can potentially reuse is the hash_64b_blocks function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm

That assembly is only for SHA-ni capable CPUs, if there's interest in a general library implementation I could cook up some AVX2 and pure SSE versions. I wouldn't know how to do (or test) the ARM versions. It is also possible to implement them with C-intrinsics, I do not know if it's easy to import this to Nim. I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.

potuz avatar Jul 27 '21 12:07 potuz

I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.

I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do

arnetheduck avatar Jul 28 '21 06:07 arnetheduck

FWIW, I posted SSE, AVX, AVX2 and SHA modifications to Intel's assembly. If you raise an issue with your SHA provider you may want to link to my repo if they want to take the constants from there. Here are my benchmarks running a generic release build

Nimbus Prysm Lighthouse (single thread) Mammon
NUC i5-8259U @ 2.30GHz 1112ms (AVX2) 1125ms (AVX2) 860ms (AVX2) 563 (AVX2) 596(AVX) 1042 (SSE)
Ryzen 5 @3.60 GHz 251ms (SHA) 292ms (SHA) 760ms (SHA) 161ms (SHA) 760ms (AVX2) 813ms (AVX) 1361 ms (SSE)
XPS 13 i5-7200U @ 2.50GHz 953 ms (AVX2) 1085ms (AVX2) 998ms (AVX2) 654 ms (AVX2) 681ms(AVX) 964ms (SSE)

It's true that it's Intel's assembly running on a Ryzen, but I am not buying AMD again.

I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do

choosing the right implementation at runtime is just a one liner call to cpuid, here's mammon's implementation of that (I'm only supporting reasonable CPU's though) https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L46

potuz avatar Aug 03 '21 00:08 potuz

somewhat old:ish paper on intel sha256 implementations: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sha-256-implementations-paper.pdf

arnetheduck avatar Aug 18 '21 07:08 arnetheduck

cc @etan-status

mratsim avatar Oct 21 '21 16:10 mratsim

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

arnetheduck avatar Nov 21 '21 09:11 arnetheduck

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?

https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35

mratsim avatar Nov 22 '21 13:11 mratsim

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?

https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35

nope, the hardcoded padding block refers to the fact that you can have the scheduled words hardcoded in the assembly. The relevant part of the assembly code is here https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L730-L738

And a little explanation of the phenomenon is in the document linked in the first description of this issue.

potuz avatar Nov 22 '21 14:11 potuz

We'll be using https://github.com/prysmaticlabs/hashtree at prysm, it's still too early to adopt that library, but I'd be very happy to see some benchmarks from Nimbus if you test it.

potuz avatar Jan 03 '22 19:01 potuz

  • https://github.com/status-im/nimbus-eth2/pull/5188
  • https://github.com/status-im/nimbus-eth2/pull/6262
  • https://github.com/status-im/nimbus-eth2/pull/6278
  • https://github.com/status-im/nimbus-eth2/pull/6292

tersec avatar Jun 06 '24 02:06 tersec

v24.6.0 shipped hashtree.

tersec avatar Jul 04 '24 06:07 tersec