LLaVA-NeXT
LLaVA-NeXT copied to clipboard
[Bug] The sampler give wrong length for multi-node training
Hi,
First, thank you for the excellent work on this project!
I encountered an issue while fine-tuning the model with multiple modes. Specifically, I observed the following behavior:
- When using 4 nodes with 1 gradient accumulation step, the total number of steps is 4 times greater compared to using 1 node with 4 gradient accumulation steps.
Upon examining the sampler code in the [LLaVA-NeXT Trainer implementation](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/79ef45a6d8b89b92d7a8525f077c3a3a9894a87d/llava/train/llava_trainer.py#L273), I noticed something that seems off. Here's the relevant snippet:
if self.args.group_by_length:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
)
It appears that the world_size is being multiplied by gradient_accumulation_steps, which might be causing unintended behavior. This seems to increase the number of steps when using multiple nodes, as each node's contribution is scaled incorrectly.
Could you confirm whether this logic is intentional or suggest a solution to address this discrepancy?
Thank you for your time and help!