hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

GPU lost

Open elricwan opened this issue 2 years ago • 3 comments

Hi there,

In some experiments, I face the situation where one gpu is lost during the training. And I have to restart the work again. Have ever encountered that issue? Thank you.

elricwan avatar Sep 21 '22 16:09 elricwan

Hi! That's strange, what does the GPU itself print? Is the trainer process on that GPU down or running alone? Does the lost GPU see others?

And I have to restart the work again.

sanity check: if you're using hivemind.Optimizer, you should be able to restart only the one GPU that is lost, not all of them

justheuristic avatar Sep 21 '22 17:09 justheuristic

Thank you for the quick response. I did the experiments on one machine that contains two GPUs. So when one gpu went down, I have to restart the machine in order to maintain the performance. Here is the figure of training process:

Screen Shot 2022-09-21 at 1 45 26 PM

The lost GPU cannot sync during the training. And when I type "nvidia-smi", it returns that the GPU is not detected.

elricwan avatar Sep 21 '22 17:09 elricwan

Hi! Since it says that GPU is not detected, it typically means there was some hardware or driver issue.

For instance, last time i had this behavior in late summer, i've seen this line in /var/log/syslog for that day:

Aug 29 16:58:38 my-devbox-gpu kernel: [14282.651831] NVRM: Xid (PCI:0000:03:00): 79, 
pid='<unknown>', name=<unknown>, GPU has fallen off the bus.

In my case, the issue was using cheap pcie riser. In other cases, it might be an issue with motherboard and/or GPU and/or the way it is connected and/or driver compatibility. Or you may find a different error altogether.

justheuristic avatar Sep 22 '22 13:09 justheuristic

Hi, Thank you for the explanation. To be more clear about my case, I run the code again and reproduce the error. When I type nvidia-smi, I got

Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

In that peer, it shows:

Sep 22 11:18:13.105 [ERROR] [hivemind.averaging.averager._step:486] Averaging step failed: could not find a group
Traceback (most recent call last):
  File "/home/protago/miniconda3/envs/hivemind/lib/python3.8/site-packages/hivemind/averaging/averager.py", line 459, in _step
    raise AllreduceException("Averaging step failed: could not find a group")
hivemind.averaging.partition.AllreduceException: Averaging step failed: could not find a group

When the error occurs, I notice that the loss is not decreasing anymore, the figure looks like:

Screen Shot 2022-09-22 at 11 26 17 AM

The hivemind version I use is 1.0.1. The code I run is the VIT example with hivemind framework. I attach my code below. (The data I use is ImageNet, to run the code, we just need to put the ImageNet dataset in the local folder). By the way, if the code works fine, maybe we could add a vision example on hivemind in the future?

vit_hivemind.zip

elricwan avatar Sep 22 '22 15:09 elricwan

Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

To be frank: we won't be able to help you with this problem. It could be either a driver issue (in which case, the solution is to switch to a different driver by trial and error) or a hardware issue (in which case you'll have to swap GPUs or try a different slot on your motherboard).

The hivemind version I use is 1.0.1.

We had a problem earlier where some settings (offload_optimizer=True) would not train with exactly one peer

By the way, if the code works fine, maybe we could add a vision example on hivemind in the future?

Absolutely!

justheuristic avatar Sep 26 '22 10:09 justheuristic

I see, thank you for your help!

elricwan avatar Sep 26 '22 15:09 elricwan