brainflayer icon indicating copy to clipboard operation
brainflayer copied to clipboard

SSE Optimization

Open g4m613r opened this issue 3 years ago • 5 comments

Please add this instruction to speed up. https://github.com/JeanLucPons/VanitySearch/tree/master/hash

g4m613r avatar Jan 16 '22 03:01 g4m613r

Getting an appreciable speed benefit here would require substantial refactoring to include hashing in brainflayer's batching, and I have quite a lot of higher priority work to do, including integrating my own hashing library.

In contrast with vanity search programs, brainflayer spends quite a lot of time doing things other than hashing, so even with parallel hash computation the benefit is questionable.

ryancdotorg avatar Jan 17 '22 22:01 ryancdotorg

I did few comparisons which I'd like to share.

I've used leadiro/sha256 command to calculate the speed of sha256 hashing alone on my machine (hash per line) and it's around 347k/s:

$ seq 1 10000000 | sha256 /dev/stdin | pv -l | wc -l
10.0M 0:00:28 [ 347k/s] [                                                           <=>                                            ]
10000000

Now I've compared the total speed of brainflayer for the full run (63k/s):

$ ./hex2blf example.hex example.blf
$ seq 1 10000000 | ./brainflayer -b example.blf -v
 rate:  63776.06 p/s found:     0/10000000   elapsed:  156.799 s

and speed of brainflayer with sha256 command output as input (52k/s):

$ seq 1 10000000 | sha256 /dev/stdin | ./brainflayer -b example.blf -t priv -x -v
 rate:  52443.48 p/s found:     0/10000000   elapsed:  190.681 s

(above: brainflayer seems to be quicker in hash calculation (~+10k/s) than the other command, however the same speed is reported as previous when using a static generated file)

and the speed of brainflayer with fake sha256 file filled with zeros as input (speed: 102k/s):

printf '0%.0s' {1..64000000} | fold -w64 > sha256-zeros.txt
$ ./brainflayer -b example.blf -i sha256-zeros.txt -t priv -x -v
 rate: 102481.49 p/s found:     0/1000000    elapsed:    9.758 s

So even if sha256 hashing is removed from the equation, the speed is only increased by ~70%. And if optimized, it probably won't make a big difference. Not sure about ripemd160.

kenorb avatar Jun 14 '22 22:06 kenorb

Due to the way the bloom filter works, you really need to use random data, and also all zero bits isn't a valid private key, I'd recommend fixing that if you want a valid test.

ryancdotorg avatar Jun 14 '22 23:06 ryancdotorg

Ok, did more accurate test with profiling (compiled with -pg) which is showing the slowest parts which can be optimized.

$ seq 1 10000000 | ./brainflayer -b example.blf
$ gprof -bQ ./brainflayer gmon.out | head -n20
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.25     57.45    57.45 1279176572     0.00     0.00  secp256k1_fe_mul_inner
 22.44    104.76    47.31 10002432     0.00     0.00  secp256k1_ecmult_gen2
 14.70    135.75    30.99 20000000     0.00     0.00  ripemd160_rawcompress
  8.59    153.86    18.11 160038913     0.00     0.00  secp256k1_gej_add_ge_var
  6.11    166.74    12.88 20007307     0.00     0.00  secp256k1_fe_get_b32
  5.89    179.16    12.42 465356038     0.00     0.00  secp256k1_fe_sqr_inner
  3.82    187.22     8.06 755426028     0.00     0.00  secp256k1_fe_negate
  2.76    193.04     5.82 906513513     0.00     0.00  secp256k1_fe_add
  1.91    197.06     4.02 1279176572     0.00     0.00  secp256k1_fe_mul
  1.89    201.05     3.99                             main
  1.88    205.02     3.97 300072963     0.00     0.00  secp256k1_fe_normalize_weak
  0.61    206.31     1.29 465356038     0.00     0.00  secp256k1_fe_sqr
  0.46    207.29     0.98 10002434     0.00     0.00  secp256k1_gej_set_ge
  0.41    208.16     0.87 151086674     0.00     0.00  secp256k1_fe_mul_int
  0.39    208.98     0.82 151087486     0.00     0.00  secp256k1_fe_normalizes_to_zero_var

secp256k1 is the slowest (~50%).

In the referenced link in the first post, there is no implementation of secp256k1, just for ripemd160. Also there is no point in optimizing sha256, because it uses a tiny portion (pass2priv takes only 0.04% of the total run).

kenorb avatar Jun 14 '22 23:06 kenorb

I'm surprised to see ripemd160_rawcompress so high there, tbh.

ryancdotorg avatar Jun 15 '22 12:06 ryancdotorg