CLIP4Clip accelerate sim_matrix process in multi-GPU

accelerate sim_matrix process in multi-GPU

Open zsnoob opened this issue 1 year ago • 1 comments

I edit two main things:

Deleting the "loss.mean()" that do nothing. DDP provides automatically gradient synchronization.
Refer to this comment, https://github.com/openai/CLIP/issues/132#issuecomment-908004353 we will do every similarity calculation locally. This will use all negative samples in global batch and positive samples in local batch, so local sim_matrix will be shaped in (batch_size / n_gpu, batch_size).

But this approach will cause another question that if the loss function always put the first local-batch-size columns as the diagonal elements, and local batch is not the first in global, the correct postive samples will locating at column range: local_rank * local_batch_size - (local_rank + 1) * local_batch_size. So i give the second parameter in torch.diag() which means the first positive sample's column.

By experiments, the model can converge as usual, and more efficient.

Nov 24 '23 13:11 zsnoob

Also mentioned in this issue, https://github.com/ArrowLuo/CLIP4Clip/issues/101#issue-1607949547.

Nov 24 '23 14:11 zsnoob