Loss calculation across GPUs using all_gather_with_grad function

Open AlephZr opened this issue 1 year ago • 0 comments

The code uses the all_gather_with_grad function to collect the tensor and gradient on all GPUs in order to compute the comparison loss across GPUs. I can successfully train the BLIP-2 model using this function. But when I use it on my model, the model gets stuck after a certain number of iterations and neither reports an error nor continues training. And the memory and RAM are normal. Is there a detail I'm missing?

Apr 25 '24 08:04 AlephZr