tor-v3-vanity Why is the performance so terrible?

The tor-v3 vanity generator on cathugger/mkp224o has no GPU support and is faster on my 10 year old CPU than the numbers shown in the README here.

I get about 1 vanity hash every 10 seconds for 5 character prefix on my CPU.

I did not test if the numbers are true, cause I do not have Nvidia GPU, but this should be A LOT faster on the GPU than this.

Anyone tested the both implementation side by side? Is tor-v3-vanity really that slow? That does not sound right.

With scallion it was easily possible to have 8 and 9 character prefixes.

May 23 '21 18:05 FreeApophis

Testing using mkp224o on an i7-8565U, even with the best optimization for me (--enable-binsearch --enable-amd64-64-24k --enable-intfilter=64), I'm only getting ~15MK/sec, and running the tor-v3-vanity with a GTX-1660, I'm getting ~5GK/sec.

However I noticed that only 1 core is busy, maybe if the validations were forwarded and validated in multi-thread, I would get more performance, taking better advantage of the keys generated by the GPU.

Jun 08 '21 18:06 marcialvieira

@FreeApophis is correct, the output information is confusing, it wasn't 5GK/sec, the output was showing me a cumulative count, so the correct count is 297KK/sec.

BTW: I'm getting 368KK/s with the mkp224o on a raspberry pi 2. :O

Jun 09 '21 13:06 marcialvieira

Thanks for the numbers, so there is defintily something wrong with the implementation when a Raspberry is faster than a GTX-1660.

Jun 09 '21 15:06 FreeApophis

4x2080Ti

4x3090

Aug 12 '21 19:08 23cku0r

As you can see @23cku0r posted, his benchmark is 8x my raspberry pi 2 performance, so just 2 rasps CPU-based have the equivalent performance of a 2080Ti GPU-based performance. lol

Aug 13 '21 16:08 marcialvieira

Languages Rust 100.0%

Sep 27 '21 15:09 megapro17

I took a look at the code again, and I don't see an obvious reason why it should be so much slower. This was a weekend pet project I threw together a while back just to try out the nvptx target for rust. I have too much going on right now to look into this, but if anyone takes the time to instrument the code and determine where the bottleneck is, I'm happy to address the problem.

Nov 04 '21 05:11 dr-bonez

My best guess is that there's an issue with automatic block size detection. 256 threads with 272 blocks seems low for a 2080ti.

Nov 04 '21 05:11 dr-bonez

Something is definitely wrong here, this is my experience running it on my gtx 1080

=27116== NVPROF is profiling process 27116, command: ./t3v -d keys hello
Launching kernel on device #0 with 256 threads and 60 blocks
Tried 2012160 / 33554432 (expected) keys.
Running for 30 seconds / 8 minutes, 21 seconds (expected).
Tried 4024320 / 33554432 (expected) keys.
Running for 1 minutes, 0 seconds / 8 minutes, 21 seconds (expected).
Tried 6036480 / 33554432 (expected) keys.
Running for 1 minutes, 30 seconds / 8 minutes, 21 seconds (expected).
^C==27116== Profiling application: ./t3v -d keys hello
==27116== Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.
==27116== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  90.9431s       397  229.08ms  213.42ms  253.10ms  render
                    0.00%  591.57us       794     745ns     351ns  3.0080us  [CUDA memcpy DtoH]
                    0.00%  226.64us       404     561ns     480ns  1.2480us  [CUDA memcpy HtoD]
      API calls:   99.87%  90.9558s       397  229.11ms  213.43ms  253.10ms  cuStreamSynchronize
                    0.11%  100.82ms         1  100.82ms  100.82ms  100.82ms  cuCtxCreate
                    0.01%  10.058ms       794  12.667us  5.8100us  93.444us  cuMemcpyDtoH
                    0.01%  5.0190ms       398  12.610us  9.2540us  72.085us  cuLaunchKernel
                    0.00%  1.7040ms       404  4.2170us  2.4680us  155.27us  cuMemcpyHtoD
                    0.00%  1.6310ms         1  1.6310ms  1.6310ms  1.6310ms  cuModuleLoadData
                    0.00%  255.92us       399     641ns     280ns  1.9780us  cuModuleGetFunction
                    0.00%  109.00us         6  18.166us  1.7130us  99.015us  cuMemAlloc
                    0.00%  9.9490us         1  9.9490us  9.9490us  9.9490us  cuStreamCreateWithPriority
                    0.00%  4.9050us         1  4.9050us  4.9050us  4.9050us  cuDeviceGetPCIBusId
                    0.00%  1.7310us         6     288ns     139ns     553ns  cuDeviceGetAttribute
                    0.00%     832ns         3     277ns     107ns     554ns  cuDeviceGetCount
                    0.00%     555ns         2     277ns     101ns     454ns  cuFuncGetAttribute
                    0.00%     500ns         2     250ns      98ns     402ns  cuDeviceGet

Nov 18 '21 16:11 ghost

Same issues here. Figured I was just having bad luck, but no.. running on an 8 GPU server produces less result than multi processor mkp224o. Was really looking forward to this too, as it's the ONLY solution currently in existence for v3 onions.

Dec 29 '21 13:12 scramblr

tor-v3-vanity tor-v3-vanity copied to clipboard

Why is the performance so terrible?

tor-v3-vanity
tor-v3-vanity copied to clipboard