LAVIS
LAVIS copied to clipboard
Loss calculation across GPUs using all_gather_with_grad function
The code uses the all_gather_with_grad function to collect the tensor and gradient on all GPUs in order to compute the comparison loss across GPUs.
I can successfully train the BLIP-2 model using this function. But when I use it on my model, the model gets stuck after a certain number of iterations and neither reports an error nor continues training. And the memory and RAM are normal.
Is there a detail I'm missing?