litdata Bug: Issues with Dataloader Batching Resulting in Uneven number of Batches and Streamed Items

🐛 Bug Report: Issues with Dataloader Batching

Title: Dataloader Producing an Uneven Number of Batches and Streamed Items

Description: The data loader is outputting more batches than expected (when num_workers > 1). Additionally, the trailing batches have an uneven number of samples, disregarding the specified batch size. This results in the total number of samples yielded at the end of an epoch exceeding the available samples.

To Reproduce

Create Optimized Dataset

from litdata import optimize


def random_data(index):
    return index

if __name__ == "__main__":
    optimize(
        fn=random_data,
        inputs=list(range(100)),
        output_dir="my_optimized_dataset",
        num_workers=4,
        chunk_bytes="64MB",
    )

Run the following script to reproduce the behaviour


dataset = StreamingDataset("my_optimized_dataset")
dataloader = StreamingDataLoader(dataset, num_workers=4, batch_size=4)
for batch_idx, batch in enumerate(dataloader):
    if batch_idx == 0:
        print("Epoch", dataloader.current_epoch)
    print(batch, end=" ")

print("\n\n\nlast batch idx:", batch_idx)
print("len dataloader:", len(dataloader))
print("\n\nLen of dataset:", len(dataset))
print("samples streamed:", dataloader._num_samples_yielded_streaming)

Output It seems to only occur when num_workers > 1 and

Epoch 1
tensor([0, 1, 2, 3]) tensor([25, 26, 27, 28]) tensor([50, 51, 52, 53]) tensor([75, 76, 77, 78]) tensor([4, 5, 6, 7]) tensor([29, 30, 31, 32]) tensor([54, 55, 56, 57]) tensor([79, 80, 81, 82]) tensor([ 8,  9, 10, 11]) tensor([33, 34, 35, 36]) tensor([58, 59, 60, 61]) tensor([83, 84, 85, 86]) tensor([12, 13, 14, 15]) tensor([37, 38, 39, 40]) tensor([62, 63, 64, 65]) tensor([87, 88, 89, 90]) tensor([16, 17, 18, 19]) tensor([41, 42, 43, 44]) tensor([66, 67, 68, 69]) tensor([91, 92, 93, 94]) tensor([20, 21, 22, 23]) tensor([45, 46, 47, 48]) tensor([70, 71, 72, 73]) 
tensor([95, 96, 97, 98]) tensor([24]) tensor([49]) tensor([74]) tensor([99]) 


last batch idx: 27
len dataloader: 25


Len of dataset: 100
samples streamed: 112

Expected behavior

The data loader should produce a consistent number of batches, each adhering to the specified batch size, with the total number of samples matching the available dataset size.

Environment

PyTorch Version (e.g., 1.0): 2.4.0
OS (e.g., Linux): Mac OS
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.12.4
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Aug 11 '24 14:08 bhimrazy

I would assume this is expected behaviour in concurrent setting and only affecting the last batches. You can drop the last batch using

dataloader = StreamingDataLoader(
        dataset, num_workers=4, batch_size=4, drop_last=True
)

Epoch 1
tensor([0, 1, 2, 3]) tensor([24, 25, 26, 27]) tensor([48, 49, 50, 51]) tensor([72, 73, 74, 75]) tensor([4, 5, 6, 7]) tensor([28, 29, 30, 31]) tensor([52, 53, 54, 55]) tensor([76, 77, 78, 79]) tensor([ 8,  9, 10, 11]) tensor([32, 33, 34, 35]) tensor([56, 57, 58, 59]) tensor([80, 81, 82, 83]) tensor([12, 13, 14, 15]) tensor([36, 37, 38, 39]) tensor([60, 61, 62, 63]) tensor([84, 85, 86, 87]) tensor([16, 17, 18, 19]) tensor([40, 41, 42, 43]) tensor([64, 65, 66, 67]) tensor([88, 89, 90, 91]) tensor([20, 21, 22, 23]) tensor([44, 45, 46, 47]) tensor([68, 69, 70, 71]) tensor([92, 93, 94, 95]) 


last batch idx: 23
len dataloader: 24


Len of dataset: 96
samples streamed: 96

Aug 14 '24 15:08 gluonfield

Thank you, @AugustDev, for the insights and solution. Closing this issue, as the behaviour seems to be expected in a multi-worker setting.

cc: @tchaton

Sep 03 '24 18:09 bhimrazy