tor-v3-vanity
tor-v3-vanity copied to clipboard
Why is the performance so terrible?
The tor-v3 vanity generator on cathugger/mkp224o has no GPU support and is faster on my 10 year old CPU than the numbers shown in the README here.
I get about 1 vanity hash every 10 seconds for 5 character prefix on my CPU.
I did not test if the numbers are true, cause I do not have Nvidia GPU, but this should be A LOT faster on the GPU than this.
Anyone tested the both implementation side by side? Is tor-v3-vanity really that slow? That does not sound right.
With scallion it was easily possible to have 8 and 9 character prefixes.
Testing using mkp224o on an i7-8565U, even with the best optimization for me (--enable-binsearch --enable-amd64-64-24k --enable-intfilter=64), I'm only getting ~15MK/sec, and running the tor-v3-vanity with a GTX-1660, I'm getting ~5GK/sec.
However I noticed that only 1 core is busy, maybe if the validations were forwarded and validated in multi-thread, I would get more performance, taking better advantage of the keys generated by the GPU.
@FreeApophis is correct, the output information is confusing, it wasn't 5GK/sec, the output was showing me a cumulative count, so the correct count is 297KK/sec.
BTW: I'm getting 368KK/s with the mkp224o on a raspberry pi 2. :O
Thanks for the numbers, so there is defintily something wrong with the implementation when a Raspberry is faster than a GTX-1660.
4x2080Ti
4x3090
As you can see @23cku0r posted, his benchmark is 8x my raspberry pi 2 performance, so just 2 rasps CPU-based have the equivalent performance of a 2080Ti GPU-based performance. lol
Languages Rust 100.0%
I took a look at the code again, and I don't see an obvious reason why it should be so much slower. This was a weekend pet project I threw together a while back just to try out the nvptx target for rust. I have too much going on right now to look into this, but if anyone takes the time to instrument the code and determine where the bottleneck is, I'm happy to address the problem.
My best guess is that there's an issue with automatic block size detection. 256 threads with 272 blocks seems low for a 2080ti.
Something is definitely wrong here, this is my experience running it on my gtx 1080
=27116== NVPROF is profiling process 27116, command: ./t3v -d keys hello
Launching kernel on device #0 with 256 threads and 60 blocks
Tried 2012160 / 33554432 (expected) keys.
Running for 30 seconds / 8 minutes, 21 seconds (expected).
Tried 4024320 / 33554432 (expected) keys.
Running for 1 minutes, 0 seconds / 8 minutes, 21 seconds (expected).
Tried 6036480 / 33554432 (expected) keys.
Running for 1 minutes, 30 seconds / 8 minutes, 21 seconds (expected).
^C==27116== Profiling application: ./t3v -d keys hello
==27116== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==27116== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 90.9431s 397 229.08ms 213.42ms 253.10ms render
0.00% 591.57us 794 745ns 351ns 3.0080us [CUDA memcpy DtoH]
0.00% 226.64us 404 561ns 480ns 1.2480us [CUDA memcpy HtoD]
API calls: 99.87% 90.9558s 397 229.11ms 213.43ms 253.10ms cuStreamSynchronize
0.11% 100.82ms 1 100.82ms 100.82ms 100.82ms cuCtxCreate
0.01% 10.058ms 794 12.667us 5.8100us 93.444us cuMemcpyDtoH
0.01% 5.0190ms 398 12.610us 9.2540us 72.085us cuLaunchKernel
0.00% 1.7040ms 404 4.2170us 2.4680us 155.27us cuMemcpyHtoD
0.00% 1.6310ms 1 1.6310ms 1.6310ms 1.6310ms cuModuleLoadData
0.00% 255.92us 399 641ns 280ns 1.9780us cuModuleGetFunction
0.00% 109.00us 6 18.166us 1.7130us 99.015us cuMemAlloc
0.00% 9.9490us 1 9.9490us 9.9490us 9.9490us cuStreamCreateWithPriority
0.00% 4.9050us 1 4.9050us 4.9050us 4.9050us cuDeviceGetPCIBusId
0.00% 1.7310us 6 288ns 139ns 553ns cuDeviceGetAttribute
0.00% 832ns 3 277ns 107ns 554ns cuDeviceGetCount
0.00% 555ns 2 277ns 101ns 454ns cuFuncGetAttribute
0.00% 500ns 2 250ns 98ns 402ns cuDeviceGet
Same issues here. Figured I was just having bad luck, but no.. running on an 8 GPU server produces less result than multi processor mkp224o. Was really looking forward to this too, as it's the ONLY solution currently in existence for v3 onions.