Jianfeng Wang
Jianfeng Wang
Is there any interface to avoid empty clusters?
In gradient accumulation, we do not need to gather the gradient for the first N - 1 iterations. If it is pytorch/DDP, we can use the no_sync() as follows. In...
By default, 300 epochs are used for the training. On a machine with 4 P100, it needs about 21 days. Is it normal? How is the training time with V100...
I find the proposed training strategy is 1) train the backbone with the labels and the contrastive loss, 2) finetune the last linear layer. The baseline approach is train the...
If there are N GPUs, the snapshot will be N files for optimizer states. Each file corresponds to 1 GPU. (let me know if the understanding is not correct). Then,...