litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Batch size beginning to vary half way through epoch

Open MarcoForte opened this issue 1 year ago • 8 comments

🐛 Bug

Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.

To Reproduce

I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.

Screenshot 2024-06-21 at 11 32 28

Code sample

Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments. I'm launching the training with torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py I use sets = [StreamingDataset(a),StreamingDataset(b))] and Dataloader(CombinedStreamingDataset(datasets=sets)) I launch the training through trainer.fit. drop_last=True

Expected behavior

Fixed batch size throughout epoch.

Environment

Using the ngc 23.05

Ubuntu 22.04 including Python 3.10 NVIDIA CUDA 12.4.1 NVIDIA cuBLAS 12.4.5.8 NVIDIA cuDNN 9.1.0.70 NVIDIA NCCL 2.21.5 lightning==2.3.0 litdata==0.2.12 8 x H100

MarcoForte avatar Jun 21 '24 15:06 MarcoForte

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Jun 21 '24 15:06 github-actions[bot]

Hey @MarcoForte. Fascinating, I have never seen this ;) Can you share a reproducible script with fake data ? Does this issue still happen if you use a single StreamingDataset ?

tchaton avatar Jun 21 '24 18:06 tchaton

Cheers @tchaton, yeah it was a bit surprising 👀. Only noticed it since I was in torch.compile mode, and the recompilation was being triggered causing a big slowdown. Otherwise it is possible it could go unnoticed... It did happen with the single StreamingDataset also, bypassing the CombinedStreamingDataset.

If I find a moment I'll try for a reproducible script, thanks

MarcoForte avatar Jun 21 '24 19:06 MarcoForte

Thanks a lot @MarcoForte. Looking forward for the code to debug it

tchaton avatar Jun 22 '24 09:06 tchaton

Hey @MarcoForte Any chance to provide a reproducible script ?

tchaton avatar Jun 25 '24 17:06 tchaton

Hey @MarcoForte

Unfortunately, I can't reproduce this issue on my end.

import os
from lightning_cloud.utils.data_connection import add_s3_connection
from lightning.data import StreamingDataset, StreamingDataLoader
from lightning.data.streaming.serializers import JPEGSerializer
import torchvision.transforms.v2 as T
import open_clip
from tqdm import tqdm

# 1. Add the prepared dataset to your teamspace
add_s3_connection("laoin-400m")

# 2. Create the streaming dataset
class LAIONStreamingDataset(StreamingDataset):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer = open_clip.get_tokenizer('ViT-B-32', context_length=512) # You can use any tokenizer
        self.serializer = JPEGSerializer()
        self.preprocess = T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))

    def __getitem__(self, index):
        _, image, text, _, _, _ = super().__getitem__(index)
        image = self.serializer.deserialize(image).float()
        return self.preprocess(image)

dataset = LAIONStreamingDataset(input_dir="/teamspace/s3_connections/laoin-400m")
dataloader = StreamingDataLoader(dataset, batch_size=64, num_workers=os.cpu_count())

batch_size = 64

for batch in tqdm(dataloader):
    assert batch.shape[0] == batch_size

tchaton avatar Jun 27 '24 07:06 tchaton

Hey @MarcoForte. Any updates ?

tchaton avatar Jul 12 '24 08:07 tchaton

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]