john
john copied to clipboard
NT-opencl performance regression on dual socket rig after the "avoid nvidia busy-wait"
I had a report of significant regression, from ~25 Gp/s to more like 17 Gp/s per GPU, running incremental plus -mask:?w?a?a under MPI on a GPU rig with 8x 2080ti (and more CPU cores than GPU cards). CPU utilization went from ~86% to ~61%.
I was granted access to the rig enough to reproduce, and to confirm that only reverting 7f6aed9de fixes the problem completely.
Not sure when/how much I can experiment more with that very rig. My single-GPU tests show nothing of the sort so I'm puzzled as to why it would happen with more processes and GPUs. Perhaps 3x fork would show it, so some tests on "super" (with the nvidias not simultaneously used for more things) is needed.
Meanwhile perhaps we should consider reverting that commit?
@magnumripper Can you test on that rig whether the regression is in fact specific to this format, or would also be seen with another format we treated similarly? Reverting for just one format isn't such a good idea if the problem isn't format-specific.
I believe it's only NT (and that format is kinda special with its high speed and short durations). I'll see if I can get more data.
Maybe we need to actually increase the 1ms threshold? Also, the 2.4% and 1.2% figures become rather low in absolute terms when we're near that threshold. While we could make code changes to make sure these are no lower than some numbers of microseconds, simply increasing the sleep threshold e.g. from 1ms to 10ms would avoid using ridiculously low figures there as well (they would still be computed, just never used).
What's the kernel duration for nt-opencl on 2080Ti with that mask and I guess certain autotuned LWS/GWS?
Maybe this is both a good approach and a trivial enough change:
+++ b/src/opencl_common.h
@@ -342,8 +342,8 @@ void opencl_process_event(void);
if (gpu_nvidia(device_info[gpu_id])) { \
wait_start = john_get_nano(); \
uint64_t us = wait_min >> 10; /* 2.4% less than min */ \
- if (wait_sleep && us >= 1000) \
- usleep(us); \
+ if (wait_sleep && us >= 2000) \
+ usleep(us - 1000); \
}
#define WAIT_UPDATE \
if (gpu_nvidia(device_info[gpu_id])) { \
This leaves a fixed 1ms allowance for usleep possibly sleeping longer than intended.
We're talking about eg. 45 ms kernel duration so it's actually not extremely short... I think first thing I'll do is really ensure only NT is affected. Also, need to test if --fork=8 has the same problem as MPIx8 (I just assumed so).
During my work with #5006 I had some temporary code in the macros that reliably detected and warned for oversleep. I need to find or rewrite that and possibly commit it as an alternative (eg. for -DDEBUG).
I think an oversleep wouldn't result in "from ~25 Gp/s to more like 17 Gp/s" anyway, because it'd reduce the sleep duration to 7/8 (and then again, and again if necessary) and keep that lower duration for another almost 20 kernel invocations. So if one 7/8 is enough, the worst impact of that oversleep is like 1/8/20 = ~0.6%.
I wonder if sleeping has some other impact, like making it more likely the process would move to another CPU socket, not the one having the PCIe lanes going to the right GPU, and how much impact that could have.
I too had thoughts about CPU affinity - I think I can tweak that with MPI, will try that as well.
Some more data points:
- It sure looks like other formats are affected too.
- mpirun defaults to
--bind-to socketwhen processes is >2. I tried--bind-to coreand it did not seem to make any difference. - Running forked instead of MPI show the same problem.
- Running a single process STILL HAS the problem!? System otherwise idling. This is a mystery to me as that's not what I'm seeing on other systems. I wonder if the kernel version and/or nvidia driver version could be factors.
- Your 1 ms patch doesn't help (as expected - oversleep doesn't seem to be the problem but the sleeping itself).
Basic info: 2x Xeon 4110 @ 2.10GHz, 16 cores + HT, 96 GB RAM, 8x 2080ti, nvidia 450.102.04 Linux 4.15.0-142-generic (Ubuntu 20.04, not recently updated)
Our "super" is Linux 2.6.32-754.6.3.el6.x86_64, nvidia 418.39, 2x Xeon E5-2670 My Linux dev machine is Linux 5.8.0-19-generic, nvidia 510.47.03 (single i7-4790)
Regardless of the outcome of this issue, we might want a way to disable the workarounds at run-time. In fact, it would make further researching of the issue a lot easier.
oversleep doesn't seem to be the problem but the sleeping itself
To confirm, maybe try a fixed sleep time of 1ms. Then even 1us.
want a way to disable the workarounds at run-time.
Definitely - that was always the plan, I just didn't get around to implementing it. We need a john.conf setting in the GPU section.
I should also test how hashcat's --spin-damp option behaves on that machine but CUDA can't be installed then.
Upgraded kernel to 5.4.0-99-generic and driver to 510.47.03, this did not help. Given I'm now on same kernel & driver as my dev machine that doesn't have this problem at all, the main suspect is definitely the dual CPU.
- tezos-opencl against one of its test vectors for ten seconds, shows a much smaller performance drop - about 2.5%.
- Back to NT, a fixed sleep of 1 ms does not trigger the problem, neither does 10 ms (maybe a tad) - but at 20 ms, problem is grave. This is interesting given the autotune stats show the kernel duration should be about 45 ms.
- Tried using
taskset --cpu-list 0, no difference. - Tried the
us - 1000patch again, just to confirm it did not help. - Tried the same but using
us - 10000and even that did not help. - Does the amount of work somehow vary with the internal mask, so we trigger lots of oversleep? I added some statistics but it seems the variations in kernel duration are always less than 1.5 ms between consecutive calls. With this in mind it's totally weird that the
us - 10000fix did not help. - BTW the 45 ms duration indicated by autotune is shorter when actually cracking, around 35 ms. Even with this in mind, it's weird that a fixed sleep of 20 ms triggers the problem.
- So visited that again. Again, a fixed sleep of 10 ms is good, 20 ms is bad. I think it starts degrading at ~10 ms - it doesn't appear to be a sudden change at some certain sleep time (tried in 1 ms steps).
- Tried two usleep calls, for 2x10 ms, it was just as bad as one of 20 ms.
- I believe usleep is implemented using nanosleep, but just for the sake of it I tried using nanosleep instead, no difference.
So this is not likely about oversleeping at all, but it's something with the sleeping, and only on certain machines. I wonder if ~10 ms is a threshold for something, such as socket affinity, but it should be ruled out by my test with taskset, right? Also, why would it affect NT so much worse than others (such as tezos, with similar sleep time)?
I'm out of ideas.
Thinking out loud:
I would say Nvidia did what they did for some reason; well, since a "theoretical" fix seems easy.
Perhaps on some hardware it is hard to "sleep while waiting" if one wants to keep the maximum performance?
Well, AMD doesn't have the problem. And even nvidia have no problem as long as you're using CUDA.
I believe usleep is implemented using nanosleep, but just for the sake of it I tried using nanosleep instead, no difference.
You can also try select-based sleep.
And even nvidia have no problem as long as you're using CUDA.
This reminds me: you can also try hashcat in both OpenCL and CUDA mode on that machine. Would it perform worse with CUDA?
Another experiment would be putting our own busy loop in there, tuned to consume e.g. 20ms, and see if the resulting nt-opencl performance is better, same, or worse than with usleep of the same duration.
Tried select-based sleep, no difference. As I had it readily available I also tried my alternate sleep interrupted by callback, but it obviously had the same problem. I'm giving up on this until we get some new ideas. If we can't find a workaround before next release we should probably mention it (and the config setting for disabling macros) in some docs.