vggt icon indicating copy to clipboard operation
vggt copied to clipboard

RuntimeError: batch size must be positive when accum_steps > batch size in dynamic batching

Open Exiam6 opened this issue 4 months ago • 2 comments

Hi jianyuan, thanks for this amazing work!

When using gradient accumulation with dynamic batching, the training crashes with RuntimeError: batch size must be positive if the dynamically computed batch size is smaller than accum_steps. I think there might be a bug at https://github.com/facebookresearch/vggt/blob/main/training/trainer.py#L823 The issue occurs in the interaction between dynamic batch sampling and gradient accumulation: The DynamicBatchSampler computes batch size as: batch_size = max_img_per_gpu / random_image_num When random_image_num is large (e.g., 24), this results in a small batch size (e.g., 48/24 = 2) The chunk_batch_for_accum_steps function then tries to split this batch into accum_steps chunks When accum_steps=4 and batch_size=2, the chunking logic 2 // 4 = 0 produces empty batches

Reproduction Steps:

In default.yaml: accum_steps: 4
max_img_per_gpu: 48

Complete Traceback: ERROR 2025-08-20 20:20:46,326 trainer.py: 421: Training failed with error: batch size must be positive [rank1]: Traceback (most recent call last): [rank1]: File "/path/to/training/launch.py", line 31, in [rank1]: main() [rank1]: File "/path/to/training/launch.py", line 27, in main [rank1]: trainer.run() [rank1]: File "/path/to/training/trainer.py", line 411, in run [rank1]: self.run_train() [rank1]: File "/path/to/training/trainer.py", line 433, in run_train [rank1]: self.train_epoch(dataloader) [rank1]: File "/path/to/training/trainer.py", line 616, in train_epoch [rank1]: self._run_steps_on_batch_chunks( [rank1]: File "/path/to/training/trainer.py", line 722, in _run_steps_on_batch_chunks [rank1]: loss_dict = self._step( [rank1]: File "/path/to/training/trainer.py", line 796, in _step [rank1]: y_hat = model(images=batch["images"]) [... rest of traceback ...] [rank1]: File "/path/to/vggt/layers/attention.py", line 61, in forward [rank1]: x = F.scaled_dot_product_attention(q, k, v, dropout_p=self.attn_drop.p if self.training else 0.0) [rank1]: RuntimeError: batch size must be positive

Would you please confirm if there is a problem here? It should be easily fixable by setting a minimum batch size during chunk splitting. Thanks again for your contribution to the community!

Exiam6 avatar Aug 20 '25 21:08 Exiam6

Yes, this is expected. If you want to accumulate N steps, the batch size B should be a multiple of N.

jytime avatar Aug 21 '25 22:08 jytime

This helps me a lot! By the way, you can also set the accum_steps=1 in default.yaml.

Kylesesa avatar Nov 25 '25 08:11 Kylesesa