brainflayer
brainflayer copied to clipboard
SSE Optimization
Please add this instruction to speed up. https://github.com/JeanLucPons/VanitySearch/tree/master/hash
Getting an appreciable speed benefit here would require substantial refactoring to include hashing in brainflayer's batching, and I have quite a lot of higher priority work to do, including integrating my own hashing library.
In contrast with vanity search programs, brainflayer spends quite a lot of time doing things other than hashing, so even with parallel hash computation the benefit is questionable.
I did few comparisons which I'd like to share.
I've used leadiro/sha256 command to calculate the speed of sha256 hashing alone on my machine (hash per line) and it's around 347k/s:
$ seq 1 10000000 | sha256 /dev/stdin | pv -l | wc -l
10.0M 0:00:28 [ 347k/s] [ <=> ]
10000000
Now I've compared the total speed of brainflayer for the full run (63k/s):
$ ./hex2blf example.hex example.blf
$ seq 1 10000000 | ./brainflayer -b example.blf -v
rate: 63776.06 p/s found: 0/10000000 elapsed: 156.799 s
and speed of brainflayer with sha256 command output as input (52k/s):
$ seq 1 10000000 | sha256 /dev/stdin | ./brainflayer -b example.blf -t priv -x -v
rate: 52443.48 p/s found: 0/10000000 elapsed: 190.681 s
(above: brainflayer seems to be quicker in hash calculation (~+10k/s) than the other command, however the same speed is reported as previous when using a static generated file)
and the speed of brainflayer with fake sha256 file filled with zeros as input (speed: 102k/s):
printf '0%.0s' {1..64000000} | fold -w64 > sha256-zeros.txt
$ ./brainflayer -b example.blf -i sha256-zeros.txt -t priv -x -v
rate: 102481.49 p/s found: 0/1000000 elapsed: 9.758 s
So even if sha256 hashing is removed from the equation, the speed is only increased by ~70%. And if optimized, it probably won't make a big difference. Not sure about ripemd160.
Due to the way the bloom filter works, you really need to use random data, and also all zero bits isn't a valid private key, I'd recommend fixing that if you want a valid test.
Ok, did more accurate test with profiling (compiled with -pg) which is showing the slowest parts which can be optimized.
$ seq 1 10000000 | ./brainflayer -b example.blf
$ gprof -bQ ./brainflayer gmon.out | head -n20
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
27.25 57.45 57.45 1279176572 0.00 0.00 secp256k1_fe_mul_inner
22.44 104.76 47.31 10002432 0.00 0.00 secp256k1_ecmult_gen2
14.70 135.75 30.99 20000000 0.00 0.00 ripemd160_rawcompress
8.59 153.86 18.11 160038913 0.00 0.00 secp256k1_gej_add_ge_var
6.11 166.74 12.88 20007307 0.00 0.00 secp256k1_fe_get_b32
5.89 179.16 12.42 465356038 0.00 0.00 secp256k1_fe_sqr_inner
3.82 187.22 8.06 755426028 0.00 0.00 secp256k1_fe_negate
2.76 193.04 5.82 906513513 0.00 0.00 secp256k1_fe_add
1.91 197.06 4.02 1279176572 0.00 0.00 secp256k1_fe_mul
1.89 201.05 3.99 main
1.88 205.02 3.97 300072963 0.00 0.00 secp256k1_fe_normalize_weak
0.61 206.31 1.29 465356038 0.00 0.00 secp256k1_fe_sqr
0.46 207.29 0.98 10002434 0.00 0.00 secp256k1_gej_set_ge
0.41 208.16 0.87 151086674 0.00 0.00 secp256k1_fe_mul_int
0.39 208.98 0.82 151087486 0.00 0.00 secp256k1_fe_normalizes_to_zero_var
secp256k1 is the slowest (~50%).
In the referenced link in the first post, there is no implementation of secp256k1, just for ripemd160. Also there is no point in optimizing sha256, because it uses a tiny portion (pass2priv takes only 0.04% of the total run).
I'm surprised to see ripemd160_rawcompress so high there, tbh.