litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Batch size beginning to vary half way through epoch

Open MarcoForte opened this issue 8 months ago • 7 comments

🐛 Bug

Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.

To Reproduce

I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.

Screenshot 2024-06-21 at 11 32 28

Code sample

Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments. I'm launching the training with torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py I use sets = [StreamingDataset(a),StreamingDataset(b))] and Dataloader(CombinedStreamingDataset(datasets=sets)) I launch the training through trainer.fit. drop_last=True

Expected behavior

Fixed batch size throughout epoch.

Environment

Using the ngc 23.05

Ubuntu 22.04 including Python 3.10 NVIDIA CUDA 12.4.1 NVIDIA cuBLAS 12.4.5.8 NVIDIA cuDNN 9.1.0.70 NVIDIA NCCL 2.21.5 lightning==2.3.0 litdata==0.2.12 8 x H100

MarcoForte avatar Jun 21 '24 15:06 MarcoForte