john NT-opencl performance regression on dual socket rig after the "avoid nvidia busy-wait"

I had a report of significant regression, from ~25 Gp/s to more like 17 Gp/s per GPU, running incremental plus -mask:?w?a?a under MPI on a GPU rig with 8x 2080ti (and more CPU cores than GPU cards). CPU utilization went from ~86% to ~61%.

I was granted access to the rig enough to reproduce, and to confirm that only reverting 7f6aed9de fixes the problem completely.

Not sure when/how much I can experiment more with that very rig. My single-GPU tests show nothing of the sort so I'm puzzled as to why it would happen with more processes and GPUs. Perhaps 3x fork would show it, so some tests on "super" (with the nvidias not simultaneously used for more things) is needed.

Meanwhile perhaps we should consider reverting that commit?

Feb 04 '22 13:02 magnumripper

@magnumripper Can you test on that rig whether the regression is in fact specific to this format, or would also be seen with another format we treated similarly? Reverting for just one format isn't such a good idea if the problem isn't format-specific.

Feb 04 '22 13:02 solardiz

I believe it's only NT (and that format is kinda special with its high speed and short durations). I'll see if I can get more data.

Feb 04 '22 14:02 magnumripper

Maybe we need to actually increase the 1ms threshold? Also, the 2.4% and 1.2% figures become rather low in absolute terms when we're near that threshold. While we could make code changes to make sure these are no lower than some numbers of microseconds, simply increasing the sleep threshold e.g. from 1ms to 10ms would avoid using ridiculously low figures there as well (they would still be computed, just never used).

What's the kernel duration for nt-opencl on 2080Ti with that mask and I guess certain autotuned LWS/GWS?

Feb 04 '22 16:02 solardiz

Maybe this is both a good approach and a trivial enough change:

+++ b/src/opencl_common.h
@@ -342,8 +342,8 @@ void opencl_process_event(void);
        if (gpu_nvidia(device_info[gpu_id])) { \
                wait_start = john_get_nano(); \
                uint64_t us = wait_min >> 10; /* 2.4% less than min */ \
-               if (wait_sleep && us >= 1000) \
-                       usleep(us); \
+               if (wait_sleep && us >= 2000) \
+                       usleep(us - 1000); \
        }
 #define WAIT_UPDATE \
        if (gpu_nvidia(device_info[gpu_id])) { \

This leaves a fixed 1ms allowance for usleep possibly sleeping longer than intended.

Feb 04 '22 18:02 solardiz

We're talking about eg. 45 ms kernel duration so it's actually not extremely short... I think first thing I'll do is really ensure only NT is affected. Also, need to test if --fork=8 has the same problem as MPIx8 (I just assumed so).

During my work with #5006 I had some temporary code in the macros that reliably detected and warned for oversleep. I need to find or rewrite that and possibly commit it as an alternative (eg. for -DDEBUG).

Feb 07 '22 23:02 magnumripper

I think an oversleep wouldn't result in "from ~25 Gp/s to more like 17 Gp/s" anyway, because it'd reduce the sleep duration to 7/8 (and then again, and again if necessary) and keep that lower duration for another almost 20 kernel invocations. So if one 7/8 is enough, the worst impact of that oversleep is like 1/8/20 = ~0.6%.

I wonder if sleeping has some other impact, like making it more likely the process would move to another CPU socket, not the one having the PCIe lanes going to the right GPU, and how much impact that could have.

Feb 08 '22 00:02 solardiz

I too had thoughts about CPU affinity - I think I can tweak that with MPI, will try that as well.

Feb 08 '22 00:02 magnumripper

Some more data points:

It sure looks like other formats are affected too.
mpirun defaults to --bind-to socket when processes is >2. I tried --bind-to core and it did not seem to make any difference.
Running forked instead of MPI show the same problem.
Running a single process STILL HAS the problem!? System otherwise idling. This is a mystery to me as that's not what I'm seeing on other systems. I wonder if the kernel version and/or nvidia driver version could be factors.
Your 1 ms patch doesn't help (as expected - oversleep doesn't seem to be the problem but the sleeping itself).

Basic info: 2x Xeon 4110 @ 2.10GHz, 16 cores + HT, 96 GB RAM, 8x 2080ti, nvidia 450.102.04 Linux 4.15.0-142-generic (Ubuntu 20.04, not recently updated)

Feb 09 '22 09:02 magnumripper

Our "super" is Linux 2.6.32-754.6.3.el6.x86_64, nvidia 418.39, 2x Xeon E5-2670 My Linux dev machine is Linux 5.8.0-19-generic, nvidia 510.47.03 (single i7-4790)

Feb 09 '22 10:02 magnumripper

Regardless of the outcome of this issue, we might want a way to disable the workarounds at run-time. In fact, it would make further researching of the issue a lot easier.

Feb 09 '22 10:02 magnumripper

oversleep doesn't seem to be the problem but the sleeping itself

To confirm, maybe try a fixed sleep time of 1ms. Then even 1us.

want a way to disable the workarounds at run-time.

Definitely - that was always the plan, I just didn't get around to implementing it. We need a john.conf setting in the GPU section.

Feb 09 '22 15:02 solardiz

I should also test how hashcat's --spin-damp option behaves on that machine but CUDA can't be installed then.

Feb 10 '22 10:02 magnumripper

Upgraded kernel to 5.4.0-99-generic and driver to 510.47.03, this did not help. Given I'm now on same kernel & driver as my dev machine that doesn't have this problem at all, the main suspect is definitely the dual CPU.

tezos-opencl against one of its test vectors for ten seconds, shows a much smaller performance drop - about 2.5%.
Back to NT, a fixed sleep of 1 ms does not trigger the problem, neither does 10 ms (maybe a tad) - but at 20 ms, problem is grave. This is interesting given the autotune stats show the kernel duration should be about 45 ms.
Tried using taskset --cpu-list 0, no difference.
Tried the us - 1000 patch again, just to confirm it did not help.
Tried the same but using us - 10000 and even that did not help.
Does the amount of work somehow vary with the internal mask, so we trigger lots of oversleep? I added some statistics but it seems the variations in kernel duration are always less than 1.5 ms between consecutive calls. With this in mind it's totally weird that the us - 10000 fix did not help.
BTW the 45 ms duration indicated by autotune is shorter when actually cracking, around 35 ms. Even with this in mind, it's weird that a fixed sleep of 20 ms triggers the problem.
So visited that again. Again, a fixed sleep of 10 ms is good, 20 ms is bad. I think it starts degrading at ~10 ms - it doesn't appear to be a sudden change at some certain sleep time (tried in 1 ms steps).
Tried two usleep calls, for 2x10 ms, it was just as bad as one of 20 ms.
I believe usleep is implemented using nanosleep, but just for the sake of it I tried using nanosleep instead, no difference.

So this is not likely about oversleeping at all, but it's something with the sleeping, and only on certain machines. I wonder if ~10 ms is a threshold for something, such as socket affinity, but it should be ruled out by my test with taskset, right? Also, why would it affect NT so much worse than others (such as tezos, with similar sleep time)?

I'm out of ideas.

Feb 10 '22 17:02 magnumripper

Thinking out loud:

I would say Nvidia did what they did for some reason; well, since a "theoretical" fix seems easy.

Perhaps on some hardware it is hard to "sleep while waiting" if one wants to keep the maximum performance?

Feb 10 '22 19:02 claudioandre-br

Well, AMD doesn't have the problem. And even nvidia have no problem as long as you're using CUDA.

Feb 10 '22 19:02 magnumripper

I believe usleep is implemented using nanosleep, but just for the sake of it I tried using nanosleep instead, no difference.

You can also try select-based sleep.

And even nvidia have no problem as long as you're using CUDA.

This reminds me: you can also try hashcat in both OpenCL and CUDA mode on that machine. Would it perform worse with CUDA?

Feb 10 '22 22:02 solardiz

Another experiment would be putting our own busy loop in there, tuned to consume e.g. 20ms, and see if the resulting nt-opencl performance is better, same, or worse than with usleep of the same duration.

Feb 10 '22 22:02 solardiz

Tried select-based sleep, no difference. As I had it readily available I also tried my alternate sleep interrupted by callback, but it obviously had the same problem. I'm giving up on this until we get some new ideas. If we can't find a workaround before next release we should probably mention it (and the config setting for disabling macros) in some docs.

Feb 16 '22 00:02 magnumripper

john john copied to clipboard

NT-opencl performance regression on dual socket rig after the "avoid nvidia busy-wait"

john
john copied to clipboard