GPU utilization for one of four CUDA devices drops to 0% occasionally during huge batch dockings
Describe the bug In huge batch dockings we are running (~1-2 million ligands), the GPU utilization of each of our 4 NVIDIA GPUs will be around 100% initially, then drop to 3 GPUs at ~100% and one at 0%. Which GPU is inactive changes every time this occurs and will often change over time. For instance, GPU 2 could be at 0% utilization at 2 pm, but, by 4 pm, GPU 4 could be at 0% utilization, with GPU 2 at ~100%. This isn't due to thermal throttling as far as I can tell, since all GPUs are always below 80c and around only 50% fan speed at maximum for the duration of dockings. It also isn't likely due to CPU bottle-necking, since we have a Xeon CPU that never exceeds 25% utilization during the docking. CPU temps are a constant 40-45C. The issue is usually first noticeable after around 200,000-300,000 compounds have been docked. No increase in failed dockings or error messages were noticeable.
To Reproduce Run a docking of any receptor with a large number of ligands (>250,000?) on a multi-GPU device.
Expected behavior All 4 GPUs to be running at nearly 100% for the duration of the docking. This issue did not occur when I previously had run docking porjects as 4 separate processes that utilized 1 GPU each. I first noticed this occuring when I switched to using one docking process, with the addition of "-D 2,3,4,5" (our GPU #1 is not CUDA-enabled).
Information to help narrow down the bug
- Which version of AutoDock-GPU are you using? v1.6 (develop)
- Which operating system are you on? Rocky Linux 9.5
- Which compiler, compiler version, and
makecompile options did you use? gcc 11.5.0, make DEVICE=<CUDA> NUMWI = <128> - Which GPU(s) are you running on and is Cuda or OpenCL used? 4x NVIDIA RTX A5000 GPUs, CUDA
- Which driver version and if applicable, which Cuda version are you using? CUDA 12.8
- When compiling AutoDock-GPU, are
GPU_INCLUDE_PATHandGPU_LIBRARY_PATHset? Are both environment variables set to the correct directories, i.e. corresponding to the correct Cuda version or OpenCL library? Yes, this was manually confirmed in the bin directories as well. - Did this bug only show up recently? Which version of AutoDock-GPU, compiler, settings, etc. were you using that worked? This bug only showed up after I started running large dockings as one batch with the addition of "-D 2,3,4,5" to our command. This didn't occur noticeably when we ran the docking as 4 separate simultaneous batches that each use one GPU. Again, the inactive GPU is frequently rotating, suggesting it isn't a hardware related issue with one particular GPU. Unclear if this is related, but we are now storing docking files on a mechanical HDD because of the large file size. Although we initially ran dockings off of an SSD, I had previously tested the 4 seperate batchs on the HDD without noticeable throttling.
Even with this bug, AutoDock-GPU is still an insanely fast docking model, and we love to use it. Just hoping to contribute back something useful to the project with this report! Thank you.
@bmp192529 Thank you for reporting!
In your output, for each ligand job the device line tells you which GPU (i.e. 2 / 4) is being used: Does it stop using that particular GPU at some point and did a crash of that GPU occur or are there any errors listed at the end?
Also, does this occur with the current develop version? (i need to make a new release soon as there are some nice fixes in there)
Correction: I am using the develop version 1.6. Which one specifically I will reply with when I have the machine in front of me, but it was installed 1-2 months ago at the latest. Would a crashed GPU restart itself? As I said, the one GPU will intermittently drop to 0% util and then go back up later, with another GPU at 0%. In the output, you would see many consecutive ligands that go "2/5, 3/5, 4/5, 2/5, 3/5, 4/5..." (GPU 5 not being used) and then later goes to "2/5, 4/5, 5/5, 2/5, 4/5, 5/5..." (GPU 3 not being used). GPU 1, to clarify, is not CUDA enabled and therefor never used. We have never experienced this with more than 1 CUDA GPU going to 0% at a time, and there will often be periods where it goes back to all 4 GPUs being utilized. As for errors being listed, no errors pop up during the docking, and I don't recall ever noticing any at the end. We will be having a docking complete within the next day or so, so I will try to keep an eye out for the message at the end and report back.
strange behavior and might be something different entirely if the idle GPU changes ...
one thing i could think here is the number of threads not being a multiple of the number of GPUs used - i don't know how many cores your CPU has but maybe try OMP_NUM_THREADS=16 autodock_gpu_128wi .... Also is this a system with two separate CPU sockets - in some of those the PCIe lanes going to each GPU go to one or the other socket which could introduce wait times.
We have two 24 core Xeon Silver CPUs. Whether it is 48 or 24 threads, it's still an even multiple of 4. I can try the OMP_NUM_THREADS=16 option next time. I would think that the wait time between the PCIe lane switching should be on the span of less than a second, no? The inactive GPU will typically stay a constant state of inactivity for at least 30-60 minutes. It may even be longer, I just haven't been there to personally witness longer than that. I also want to try running the batch as two smaller batches running simultaneously with two GPUs each, and see if this problem occurs to any extent. Is there anything else I can try out and report back to you in order to help with diagnosing the cause of this?
Hello,
I wanted to let you know that we updated to Cuda driver 12.9 and haven't noticed the issue since then, at least from the random checks I've done on the computer's status during our most recent docking. I couldn't find any other reports online from people having this issue with the 12.8 driver, though. I'm trying to determine if there is a way to access a log for GPU utilization for the last few days, as that would be a much more definitive way to determine if the issue is fully resolved or not. I will write back if I am able to determine anything more.
Thanks again.
@bmp192529 That is great news! I haven't been able to reproduce this behavior on our machines so I am glad this may be outside AD-GPU :-)
Since we have run two large dockings since the CUDA driver update from 12.8 to 12.9 and not noticed the issue again, I will close this issue out. Thanks again for your time.
Hello, We had to reinstall our CUDA drivers due to a failed kernel update, and since then, the strange behavior with one GPU occasionally becoming inactive has come back, despite us still using CUDA 12.9. This time, it seems to be happening on one GPU more than the others. Should we contact our workstation manufacturer, or did you have any other thoughts about how it might be related to AutoDock and/or how we are running it on our workstation?
@bmp192529 AD-GPU's algorithm for each thread is to set up its ligand, wait until a GPU becomes available, dock, copy the data back, release the GPU, and do processing (while another thread already can use the GPU). There are three outputs in that process: 1) Thread N is setting up job #i, 2) Running job #i, and 3) Thread N is processing job #i. Setup and processing are output immediately, while the docking output (as it has to accumulate) is output AFTER docking is finished. Based on that, and that you're not seeing any errors and the docking eventually finishes these are possible causes for what you're seeing:
- maybe this is a case of "false negative" aka 0% is simply not correct (there are reports online of this being an option with Cuda >12.2); a possible fix for this is to make sure nvidia-smi matches the driver version and Cuda version as well as to compile with the correct compute capability (you could use
make TARGETS=86for the RTX A5000 to force compilation of only that particular target) ; alternativelymake DEVICE=OCLGPUwill compile for OpenCL which may side-step some Cuda installation issues (w/o tensor cores, OpenCL is a bit faster; with, they're about equal) - this could be a GPU going into thermal throttling - slowing down dramatically to cool down, you would see a thread setup message and then no docking output for a while while all other GPUs that are not overheated keep on going full speed ; this is consistent with the docking succeeding without errors but you wrote this isn't what you observe
- AD-GPU could have a bug, here is the list I am thinking of there:
- not releasing a GPU lock: that would mean a thread is hanging though, which means a docking that won't finish; not something we've seen
- wrong GPU assignment (the additional not used GPU?): that would be clear from the start (which GPUs are set up) and isn't something we've seen ; maybe you could use
-D allwith an OpenCL build though (if that GPU doesn't support Cuda maybe it supports OpenCL)?
- a thread hanging randomly or being really slow which is an option with an SMP system (two separate CPUs); a potential thing to try here is to use
OMP_NUM_THREADS=24 OMP_PROC_BIND=true autodock_gpu_128wi ...or other OpenMP specific controls; I don't think the likelihood of this being it is high, but it's something to try
Last but not least, I am still stumbling over the "not Cuda-enabled GPU" - if it shows up in nvidia-smi then it should be capable, and if it doesn't then AD-GPU would also not see it, so -D all should be safe ... If it's that different a GPU, maybe it trips up the driver (presuming it is an Nvidia GPU) ... So if possible, one other thing to try is to take it out.
Hello, Sorry for the late reply:
- This isn't a false negative, since reading the outputs from Autodock shows a clear absence of one gpu being used for the docking (the line showing the device used with show the same 3/4 gpus over and over, matching the system monitor utilization and nvidia-smi output)
- -D all doesn't work (that was the first thing I tried), I believe it is because the GPUs were configured by our workstation's manufacturer to have one GPU is set to "PROHIBITED" compute mode (to function as a traditional graphics card only), while the other 4 are set to "EXCLUSIVE PROCESS" mode. I probably could change the settings or use OpenCl to get around this, but I don't really see the need, since that card is less powerful that the other 4 anyway.
- Unless the thermal monitors are wrong, it isn't thermal throttling. The inactive GPU will cool all the way down to idle temp and just stay like that constantly, without any temperature peaks and valleys. Also, it doesn't seem to be throttling in terms of output, moreso a complete absence of that GPU doing anything at all for a while. A throttled GPU would still do some work. Also, this workstation computer has very powerful fans that are able to keep to GPUs in the 70-80c range pretty consistently at only around 50% power.
The issue is really intermittent so I don't know what the trigger could be. I'm not experienced with C++ at all myself, but was trying to read through the code to see any potential issues and couldn't think of anything that plausible. The point about the GPU lock was something I considered, too, but I didn't see anywhere a problem could arise. The code is very readable, by the way, so thank you for that.
If I have a chance, I'll set up a reasonable sized batch to run two different ways - one as one instance of Autodock running with 4 GPU devices, and one as 4 separate instances of Autodock running, each with one GPU. If the issue is thermal throttling, the time between the runs should be almost exactly the same, while, if the issue is some bug with Autodock queuing multiple devices, the one instance with all 4 GPUs should be significantly slower. This may help to troubleshoot the problem I think, let me know if you have any suggestions.
Thanks again for all your help.