CLIP4Clip
CLIP4Clip copied to clipboard
accelerate sim_matrix process in multi-GPU
I edit two main things:
-
Deleting the "loss.mean()" that do nothing. DDP provides automatically gradient synchronization.
-
Refer to this comment, https://github.com/openai/CLIP/issues/132#issuecomment-908004353 we will do every similarity calculation locally. This will use all negative samples in global batch and positive samples in local batch, so local sim_matrix will be shaped in (batch_size / n_gpu, batch_size).
- But this approach will cause another question that if the loss function always put the first local-batch-size columns as the diagonal elements, and local batch is not the first in global, the correct postive samples will locating at column range: local_rank * local_batch_size - (local_rank + 1) * local_batch_size. So i give the second parameter in torch.diag() which means the first positive sample's column.
By experiments, the model can converge as usual, and more efficient.
Also mentioned in this issue, https://github.com/ArrowLuo/CLIP4Clip/issues/101#issue-1607949547.