litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Bug: Issues with Dataloader Batching Resulting in Uneven number of Batches and Streamed Items

Open bhimrazy opened this issue 1 year ago • 1 comments

🐛 Bug Report: Issues with Dataloader Batching

Title: Dataloader Producing an Uneven Number of Batches and Streamed Items

Description: The data loader is outputting more batches than expected (when num_workers > 1). Additionally, the trailing batches have an uneven number of samples, disregarding the specified batch size. This results in the total number of samples yielded at the end of an epoch exceeding the available samples.

To Reproduce

Create Optimized Dataset
from litdata import optimize


def random_data(index):
    return index

if __name__ == "__main__":
    optimize(
        fn=random_data,
        inputs=list(range(100)),
        output_dir="my_optimized_dataset",
        num_workers=4,
        chunk_bytes="64MB",
    )

Run the following script to reproduce the behaviour


dataset = StreamingDataset("my_optimized_dataset")
dataloader = StreamingDataLoader(dataset, num_workers=4, batch_size=4)
for batch_idx, batch in enumerate(dataloader):
    if batch_idx == 0:
        print("Epoch", dataloader.current_epoch)
    print(batch, end=" ")

print("\n\n\nlast batch idx:", batch_idx)
print("len dataloader:", len(dataloader))
print("\n\nLen of dataset:", len(dataset))
print("samples streamed:", dataloader._num_samples_yielded_streaming)

Output It seems to only occur when num_workers > 1 and

Epoch 1
tensor([0, 1, 2, 3]) tensor([25, 26, 27, 28]) tensor([50, 51, 52, 53]) tensor([75, 76, 77, 78]) tensor([4, 5, 6, 7]) tensor([29, 30, 31, 32]) tensor([54, 55, 56, 57]) tensor([79, 80, 81, 82]) tensor([ 8,  9, 10, 11]) tensor([33, 34, 35, 36]) tensor([58, 59, 60, 61]) tensor([83, 84, 85, 86]) tensor([12, 13, 14, 15]) tensor([37, 38, 39, 40]) tensor([62, 63, 64, 65]) tensor([87, 88, 89, 90]) tensor([16, 17, 18, 19]) tensor([41, 42, 43, 44]) tensor([66, 67, 68, 69]) tensor([91, 92, 93, 94]) tensor([20, 21, 22, 23]) tensor([45, 46, 47, 48]) tensor([70, 71, 72, 73]) 
tensor([95, 96, 97, 98]) tensor([24]) tensor([49]) tensor([74]) tensor([99]) 


last batch idx: 27
len dataloader: 25


Len of dataset: 100
samples streamed: 112 

Expected behavior

The data loader should produce a consistent number of batches, each adhering to the specified batch size, with the total number of samples matching the available dataset size.

Environment

  • PyTorch Version (e.g., 1.0): 2.4.0
  • OS (e.g., Linux): Mac OS
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.12.4
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

bhimrazy avatar Aug 11 '24 14:08 bhimrazy

I would assume this is expected behaviour in concurrent setting and only affecting the last batches. You can drop the last batch using

dataloader = StreamingDataLoader(
        dataset, num_workers=4, batch_size=4, drop_last=True
)
Epoch 1
tensor([0, 1, 2, 3]) tensor([24, 25, 26, 27]) tensor([48, 49, 50, 51]) tensor([72, 73, 74, 75]) tensor([4, 5, 6, 7]) tensor([28, 29, 30, 31]) tensor([52, 53, 54, 55]) tensor([76, 77, 78, 79]) tensor([ 8,  9, 10, 11]) tensor([32, 33, 34, 35]) tensor([56, 57, 58, 59]) tensor([80, 81, 82, 83]) tensor([12, 13, 14, 15]) tensor([36, 37, 38, 39]) tensor([60, 61, 62, 63]) tensor([84, 85, 86, 87]) tensor([16, 17, 18, 19]) tensor([40, 41, 42, 43]) tensor([64, 65, 66, 67]) tensor([88, 89, 90, 91]) tensor([20, 21, 22, 23]) tensor([44, 45, 46, 47]) tensor([68, 69, 70, 71]) tensor([92, 93, 94, 95]) 


last batch idx: 23
len dataloader: 24


Len of dataset: 96
samples streamed: 96

gluonfield avatar Aug 14 '24 15:08 gluonfield

Thank you, @AugustDev, for the insights and solution. Closing this issue, as the behaviour seems to be expected in a multi-worker setting.

cc: @tchaton

bhimrazy avatar Sep 03 '24 18:09 bhimrazy