litdata
litdata copied to clipboard
Batch size beginning to vary half way through epoch
🐛 Bug
Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.
To Reproduce
I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.
Code sample
Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments.
I'm launching the training with torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py
I use sets = [StreamingDataset(a),StreamingDataset(b))]
and Dataloader(CombinedStreamingDataset(datasets=sets))
I launch the training through trainer.fit
. drop_last=True
Expected behavior
Fixed batch size throughout epoch.
Environment
Using the ngc 23.05
Ubuntu 22.04 including Python 3.10 NVIDIA CUDA 12.4.1 NVIDIA cuBLAS 12.4.5.8 NVIDIA cuDNN 9.1.0.70 NVIDIA NCCL 2.21.5 lightning==2.3.0 litdata==0.2.12 8 x H100