Bug: Issues with Dataloader Batching Resulting in Uneven number of Batches and Streamed Items
🐛 Bug Report: Issues with Dataloader Batching
Title: Dataloader Producing an Uneven Number of Batches and Streamed Items
Description: The data loader is outputting more batches than expected (when num_workers > 1). Additionally, the trailing batches have an uneven number of samples, disregarding the specified batch size. This results in the total number of samples yielded at the end of an epoch exceeding the available samples.
To Reproduce
Create Optimized Dataset
from litdata import optimize
def random_data(index):
return index
if __name__ == "__main__":
optimize(
fn=random_data,
inputs=list(range(100)),
output_dir="my_optimized_dataset",
num_workers=4,
chunk_bytes="64MB",
)
Run the following script to reproduce the behaviour
dataset = StreamingDataset("my_optimized_dataset")
dataloader = StreamingDataLoader(dataset, num_workers=4, batch_size=4)
for batch_idx, batch in enumerate(dataloader):
if batch_idx == 0:
print("Epoch", dataloader.current_epoch)
print(batch, end=" ")
print("\n\n\nlast batch idx:", batch_idx)
print("len dataloader:", len(dataloader))
print("\n\nLen of dataset:", len(dataset))
print("samples streamed:", dataloader._num_samples_yielded_streaming)
Output It seems to only occur when num_workers > 1 and
Epoch 1
tensor([0, 1, 2, 3]) tensor([25, 26, 27, 28]) tensor([50, 51, 52, 53]) tensor([75, 76, 77, 78]) tensor([4, 5, 6, 7]) tensor([29, 30, 31, 32]) tensor([54, 55, 56, 57]) tensor([79, 80, 81, 82]) tensor([ 8, 9, 10, 11]) tensor([33, 34, 35, 36]) tensor([58, 59, 60, 61]) tensor([83, 84, 85, 86]) tensor([12, 13, 14, 15]) tensor([37, 38, 39, 40]) tensor([62, 63, 64, 65]) tensor([87, 88, 89, 90]) tensor([16, 17, 18, 19]) tensor([41, 42, 43, 44]) tensor([66, 67, 68, 69]) tensor([91, 92, 93, 94]) tensor([20, 21, 22, 23]) tensor([45, 46, 47, 48]) tensor([70, 71, 72, 73])
tensor([95, 96, 97, 98]) tensor([24]) tensor([49]) tensor([74]) tensor([99])
last batch idx: 27
len dataloader: 25
Len of dataset: 100
samples streamed: 112
Expected behavior
The data loader should produce a consistent number of batches, each adhering to the specified batch size, with the total number of samples matching the available dataset size.
Environment
- PyTorch Version (e.g., 1.0): 2.4.0
- OS (e.g., Linux): Mac OS
- How you installed PyTorch (
conda,pip, source): pip - Build command you used (if compiling from source):
- Python version: 3.12.4
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
I would assume this is expected behaviour in concurrent setting and only affecting the last batches. You can drop the last batch using
dataloader = StreamingDataLoader(
dataset, num_workers=4, batch_size=4, drop_last=True
)
Epoch 1
tensor([0, 1, 2, 3]) tensor([24, 25, 26, 27]) tensor([48, 49, 50, 51]) tensor([72, 73, 74, 75]) tensor([4, 5, 6, 7]) tensor([28, 29, 30, 31]) tensor([52, 53, 54, 55]) tensor([76, 77, 78, 79]) tensor([ 8, 9, 10, 11]) tensor([32, 33, 34, 35]) tensor([56, 57, 58, 59]) tensor([80, 81, 82, 83]) tensor([12, 13, 14, 15]) tensor([36, 37, 38, 39]) tensor([60, 61, 62, 63]) tensor([84, 85, 86, 87]) tensor([16, 17, 18, 19]) tensor([40, 41, 42, 43]) tensor([64, 65, 66, 67]) tensor([88, 89, 90, 91]) tensor([20, 21, 22, 23]) tensor([44, 45, 46, 47]) tensor([68, 69, 70, 71]) tensor([92, 93, 94, 95])
last batch idx: 23
len dataloader: 24
Len of dataset: 96
samples streamed: 96
Thank you, @AugustDev, for the insights and solution. Closing this issue, as the behaviour seems to be expected in a multi-worker setting.
cc: @tchaton