litdata Batch size beginning to vary half way through epoch

Batch size beginning to vary half way through epoch

Open MarcoForte opened this issue 8 months ago • 7 comments

🐛 Bug

Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.

To Reproduce

I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.

Code sample

Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments. I'm launching the training with torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py I use sets = [StreamingDataset(a),StreamingDataset(b))] and Dataloader(CombinedStreamingDataset(datasets=sets)) I launch the training through trainer.fit. drop_last=True

Expected behavior

Fixed batch size throughout epoch.

Environment

Using the ngc 23.05

Ubuntu 22.04 including Python 3.10 NVIDIA CUDA 12.4.1 NVIDIA cuBLAS 12.4.5.8 NVIDIA cuDNN 9.1.0.70 NVIDIA NCCL 2.21.5 lightning==2.3.0 litdata==0.2.12 8 x H100

Jun 21 '24 15:06 MarcoForte

litdata litdata copied to clipboard

Batch size beginning to vary half way through epoch

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

litdata
litdata copied to clipboard