VILA
VILA copied to clipboard
About VILADistributedSampler and gradient_accumulation_steps
If we use the VILADistributedSampler (https://github.com/Efficient-Large-Model/VILA/blob/main/llava/train/llava_trainer.py#L274-L281) for Distributed Training, should the gradient_accumulation_steps
be hardcoded to 1? Since I notice that when I use 4 nodes (8 GPUs per node) to training, and set gradient_accumulation_steps to 8, the speed is fast but I think it is abnormal.