iavl all: use SHA256 with SIMD instructions for higher performance and throughout

In this repository, we heavily use the Go standard library's crypto/sha256. However there exists a Single Instruction Multiple Data (SIMD) package from our friends at Minio per https://github.com/minio/sha256-simd and it promises 8X speed ups when using AVX instructions. We should explore this.

Let's explore if performance radically improves and then plumb it in.

Kindly cc-ing my colleague @elias-orijtech

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

Jun 07 '22 02:06 odeke-em

Is it okay to assign this to you and your team @odeke-em

Jun 08 '22 10:06 tac0turtle

Is it okay to assign this to you and your team @odeke-em

Yes, please @marbar3778! We are working on it. I just need to find a machine with AVX512 so that we can produce benchmarks.

Jun 08 '22 15:06 odeke-em

In support of using that library! Though I think its probably advisable to turn off AVX 512 via build flag, given the SDK workload (https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/)

Jun 29 '22 17:06 ValarDragon

+1

Oct 03 '22 19:10 itsdevbear

Also of interest for this issue are a number of occurrences of crypto.Sha256():

screenshot

These come from what appears to be a helper function that wraps crypto/sha256:

https://github.com/cometbft/cometbft/blob/e9b91405b643b46b011865c4b7e1c1af0aa5c521/crypto/hash.go#L7-L11

We'd probably want to either replace these usages or update cometbft to use the SIMD library as well.

Feb 15 '23 23:02 kirbyquerby

thanks for the insight, i would advocate for replacing the wrapped function as we are trying to rely less on comet

Feb 15 '23 23:02 tac0turtle

The last time I check it, I don't see much improvements on dev machines I got at hand (x86_64 mac laptop and arm64 linux), on mac the stdlib is actually much faster, I just rerun the benchmark with go1.20 and post the result as follows:

arm64 linux

~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: linux
goarch: arm64
pkg: github.com/minio/sha256-simd
BenchmarkHash/Generic/8Bytes-8         	 2184978	       549.6 ns/op	  14.56 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/64Bytes-8        	 1000000	      1064 ns/op	  60.17 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1K-8             	  139132	      8623 ns/op	 118.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/8K-8             	   18447	     65101 ns/op	 125.83 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1M-8             	     144	   8288227 ns/op	 126.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/5M-8             	      28	  41402281 ns/op	 126.63 MB/s	       3 B/op	       0 allocs/op
BenchmarkHash/Generic/10M-8            	      14	  82817517 ns/op	 126.61 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/8Bytes-8         	11930301	       100.6 ns/op	  79.55 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/64Bytes-8        	 7533750	       160.1 ns/op	 399.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/1K-8             	 1547152	       775.6 ns/op	1320.21 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/8K-8             	  224019	      5354 ns/op	1530.03 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/1M-8             	    1789	    670705 ns/op	1563.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/5M-8             	     356	   3352908 ns/op	1563.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/ArmSha2/10M-8            	     178	   6706550 ns/op	1563.51 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-8        	11268408	       106.6 ns/op	  75.04 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-8       	 8466012	       141.9 ns/op	 450.98 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1K-8            	 1586331	       756.2 ns/op	1354.14 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8K-8            	  224902	      5335 ns/op	1535.60 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1M-8            	    1789	    670623 ns/op	1563.58 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/5M-8            	     356	   3352907 ns/op	1563.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/10M-8           	     178	   6703876 ns/op	1564.13 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	github.com/minio/sha256-simd	31.607s

amd64 mac

~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: darwin
goarch: amd64
pkg: github.com/minio/sha256-simd
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkHash/Generic/8Bytes-12 	 2982602	       410.3 ns/op	  19.50 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/64Bytes-12         	 1540022	       782.3 ns/op	  81.81 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1K-12              	  193633	      6219 ns/op	 164.67 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/8K-12              	   20944	     49602 ns/op	 165.15 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/1M-12              	     202	   6051028 ns/op	 173.29 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/5M-12              	      37	  32201704 ns/op	 162.81 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/Generic/10M-12             	      16	  63400945 ns/op	 165.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-12         	 6060865	       188.0 ns/op	  42.56 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-12        	 3442257	       342.0 ns/op	 187.13 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1K-12             	  493141	      2419 ns/op	 423.34 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/8K-12             	   66552	     18119 ns/op	 452.12 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/1M-12             	     512	   2310553 ns/op	 453.82 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/5M-12             	      99	  11535992 ns/op	 454.48 MB/s	       0 B/op	       0 allocs/op
BenchmarkHash/GoStdlib/10M-12            	      44	  23383451 ns/op	 448.43 MB/s	       0 B/op	       0 allocs/op
PASS
ok  	github.com/minio/sha256-simd	20.488s

Feb 16 '23 07:02 yihuang

@yihuang I did some digging and it looks like the Go standard library has support for ARM SHA extensions and AVX2, which could explain why GoStdlib and ArmSha2 have such similar performance (Generic falls so far behind because it's an implementation that doesn't use hardware acceleration).

sha256-simd advertises improved performance for processors with Intel SHA Extensions or AVX512, which the standard library doesn't have optimizations for.

I didn't see any improvements for cosmos-sdk benchmarks with the simd library on my workstation, which has Intel SHA Extensions (5950x), but I plan to also benchmark on a machine with AVX512.

Feb 16 '23 08:02 kirbyquerby

actually iavl library use sha256 heavily, should have bigger impact there.

Feb 16 '23 08:02 yihuang

I ran benchmarks for cosmos-sdk and iavl on machines with AVX512 and Intel SHA Extensions with and without using the SIMD library, and got these results: https://gist.github.com/kirbyquerby/6635113b003abdaeaa93618d4e6970a2

There didn't seem to be significant improvements (in many benchmarks, there's even a slowdown) for using the SIMD library in either cosmos-sdk or iavl.

Mar 01 '23 21:03 kirbyquerby

would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change

Mar 09 '23 08:03 tac0turtle

would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change

I can reproduce the intel benchmark result on my mac laptop, it's faster by 6x if you do at least 16 hashing operations in a batch. but their api assume user always hash 64bytes into 32bytes digest, so it can hard code the padding block, and can do multiple hashes in parallel, for iavl tree:

we don't have the fixed block to hard code
to exploit opportunities of parallel hashing, we need to change our ways of traversing the tree, for example, hashing all the leaf nodes first in a batch, then all the height=1 nodes, etc.

Mar 09 '23 21:03 yihuang

Shall we close this issue, and open new in IAVL if we want to dig more gohashtree usage there?

Mar 10 '23 11:03 robert-zaremba

ill transfer this issue there.

but their api assume user always hash 64bytes into 32bytes digest

we can either modify our code or have a variation of their code

Mar 10 '23 14:03 tac0turtle