litdata
litdata copied to clipboard
Training slowed down as time progress with litdata streaming dataset
🐛 Bug
Training slowed down as time progress.
To Reproduce
Unfortunately, I don't know what a good way to reproduce this is. This happens to certain datasets.
The dataset in question is about 2TB in size on S3.
It is prepared with
optimize(
fn=processor, # type: ignore
inputs=inputs,
output_dir=args.output_path,
num_workers=args.num_workers,
compression="zstd", # The compression algorithm to use.
chunk_bytes="128MB", # The maximum number of bytes to write into a data chunk.
reorder_files=False,
)
The training was run on Sagemaker using a single A100 machine with DDP.
Here is one run
We can see that each epoch runs slower and slower.
I am also monitoring the system metrics and can see that CPU and GPU utilization dropping while memory usage remain very low.
Expected behavior
The speed of training is consistent across all epochs.
Environment
- PyTorch Version (e.g., 1.0): 2.3.0
- OS (e.g., Linux): Ubuntu 22.4
- How you installed PyTorch (
conda
,pip
, source): pip - Build command you used (if compiling from source): N/A
- Python version: 3.11
- CUDA/cuDNN version: 12.4.1
- GPU models and configuration: A100