About VILADistributedSampler and gradient_accumulation_steps

Open dreamerlin opened this issue 1 year ago • 0 comments

If we use the VILADistributedSampler (https://github.com/Efficient-Large-Model/VILA/blob/main/llava/train/llava_trainer.py#L274-L281) for Distributed Training, should the gradient_accumulation_steps be hardcoded to 1? Since I notice that when I use 4 nodes (8 GPUs per node) to training, and set gradient_accumulation_steps to 8, the speed is fast but I think it is abnormal.

May 27 '24 06:05 dreamerlin