all: use SHA256 with SIMD instructions for higher performance and throughout
In this repository, we heavily use the Go standard library's crypto/sha256. However there exists a Single Instruction Multiple Data (SIMD) package from our friends at Minio per https://github.com/minio/sha256-simd and it promises 8X speed ups when using AVX instructions. We should explore this.
Let's explore if performance radically improves and then plumb it in.
Kindly cc-ing my colleague @elias-orijtech
For Admin Use
- [ ] Not duplicate issue
- [ ] Appropriate labels applied
- [ ] Appropriate contributors tagged
- [ ] Contributor assigned/self-assigned
Is it okay to assign this to you and your team @odeke-em
Is it okay to assign this to you and your team @odeke-em
Yes, please @marbar3778! We are working on it. I just need to find a machine with AVX512 so that we can produce benchmarks.
In support of using that library! Though I think its probably advisable to turn off AVX 512 via build flag, given the SDK workload (https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/)
+1
Also of interest for this issue are a number of occurrences of crypto.Sha256():
screenshot

These come from what appears to be a helper function that wraps crypto/sha256:
https://github.com/cometbft/cometbft/blob/e9b91405b643b46b011865c4b7e1c1af0aa5c521/crypto/hash.go#L7-L11
We'd probably want to either replace these usages or update cometbft to use the SIMD library as well.
thanks for the insight, i would advocate for replacing the wrapped function as we are trying to rely less on comet
The last time I check it, I don't see much improvements on dev machines I got at hand (x86_64 mac laptop and arm64 linux), on mac the stdlib is actually much faster, I just rerun the benchmark with go1.20 and post the result as follows:
arm64 linux
~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: linux
goarch: arm64
pkg: github.com/minio/sha256-simd
BenchmarkHash/Generic/8Bytes-8 2184978 549.6 ns/op 14.56 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/64Bytes-8 1000000 1064 ns/op 60.17 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/1K-8 139132 8623 ns/op 118.76 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/8K-8 18447 65101 ns/op 125.83 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/1M-8 144 8288227 ns/op 126.51 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/5M-8 28 41402281 ns/op 126.63 MB/s 3 B/op 0 allocs/op
BenchmarkHash/Generic/10M-8 14 82817517 ns/op 126.61 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/8Bytes-8 11930301 100.6 ns/op 79.55 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/64Bytes-8 7533750 160.1 ns/op 399.67 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/1K-8 1547152 775.6 ns/op 1320.21 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/8K-8 224019 5354 ns/op 1530.03 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/1M-8 1789 670705 ns/op 1563.39 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/5M-8 356 3352908 ns/op 1563.68 MB/s 0 B/op 0 allocs/op
BenchmarkHash/ArmSha2/10M-8 178 6706550 ns/op 1563.51 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-8 11268408 106.6 ns/op 75.04 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-8 8466012 141.9 ns/op 450.98 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/1K-8 1586331 756.2 ns/op 1354.14 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/8K-8 224902 5335 ns/op 1535.60 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/1M-8 1789 670623 ns/op 1563.58 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/5M-8 356 3352907 ns/op 1563.68 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/10M-8 178 6703876 ns/op 1564.13 MB/s 0 B/op 0 allocs/op
PASS
ok github.com/minio/sha256-simd 31.607s
amd64 mac
~/sha256-simd $ go test -run=^$ -bench=. -benchmem ./ -count=1
goos: darwin
goarch: amd64
pkg: github.com/minio/sha256-simd
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkHash/Generic/8Bytes-12 2982602 410.3 ns/op 19.50 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/64Bytes-12 1540022 782.3 ns/op 81.81 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/1K-12 193633 6219 ns/op 164.67 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/8K-12 20944 49602 ns/op 165.15 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/1M-12 202 6051028 ns/op 173.29 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/5M-12 37 32201704 ns/op 162.81 MB/s 0 B/op 0 allocs/op
BenchmarkHash/Generic/10M-12 16 63400945 ns/op 165.39 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/8Bytes-12 6060865 188.0 ns/op 42.56 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/64Bytes-12 3442257 342.0 ns/op 187.13 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/1K-12 493141 2419 ns/op 423.34 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/8K-12 66552 18119 ns/op 452.12 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/1M-12 512 2310553 ns/op 453.82 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/5M-12 99 11535992 ns/op 454.48 MB/s 0 B/op 0 allocs/op
BenchmarkHash/GoStdlib/10M-12 44 23383451 ns/op 448.43 MB/s 0 B/op 0 allocs/op
PASS
ok github.com/minio/sha256-simd 20.488s
@yihuang I did some digging and it looks like the Go standard library has support for ARM SHA extensions and AVX2, which could explain why GoStdlib and ArmSha2 have such similar performance (Generic falls so far behind because it's an implementation that doesn't use hardware acceleration).
sha256-simd advertises improved performance for processors with Intel SHA Extensions or AVX512, which the standard library doesn't have optimizations for.
I didn't see any improvements for cosmos-sdk benchmarks with the simd library on my workstation, which has Intel SHA Extensions (5950x), but I plan to also benchmark on a machine with AVX512.
actually iavl library use sha256 heavily, should have bigger impact there.
I ran benchmarks for cosmos-sdk and iavl on machines with AVX512 and Intel SHA Extensions with and without using the SIMD library, and got these results: https://gist.github.com/kirbyquerby/6635113b003abdaeaa93618d4e6970a2
There didn't seem to be significant improvements (in many benchmarks, there's even a slowdown) for using the SIMD library in either cosmos-sdk or iavl.
would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change
would be interesting to test this in iavl https://github.com/prysmaticlabs/gohashtree. see if there is any change
I can reproduce the intel benchmark result on my mac laptop, it's faster by 6x if you do at least 16 hashing operations in a batch. but their api assume user always hash 64bytes into 32bytes digest, so it can hard code the padding block, and can do multiple hashes in parallel, for iavl tree:
- we don't have the fixed block to hard code
- to exploit opportunities of parallel hashing, we need to change our ways of traversing the tree, for example, hashing all the leaf nodes first in a batch, then all the
height=1nodes, etc.
Shall we close this issue, and open new in IAVL if we want to dig more gohashtree usage there?
ill transfer this issue there.
but their api assume user always hash 64bytes into 32bytes digest
we can either modify our code or have a variation of their code