litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Training slowed down as time progress with litdata streaming dataset

Open ouj opened this issue 9 months ago • 3 comments

🐛 Bug

Training slowed down as time progress.

To Reproduce

Unfortunately, I don't know what a good way to reproduce this is. This happens to certain datasets.

The dataset in question is about 2TB in size on S3.

It is prepared with

    optimize(
        fn=processor,  # type: ignore
        inputs=inputs,
        output_dir=args.output_path,
        num_workers=args.num_workers,
        compression="zstd",  # The compression algorithm to use.
        chunk_bytes="128MB",  # The maximum number of bytes to write into a data chunk.
        reorder_files=False,
    )

The training was run on Sagemaker using a single A100 machine with DDP.

Here is one run

training_epoch

We can see that each epoch runs slower and slower.

I am also monitoring the system metrics and can see that CPU and GPU utilization dropping while memory usage remain very low.

Screenshot 2024-05-22 at 10 17 39 AM

Expected behavior

The speed of training is consistent across all epochs.

Environment

  • PyTorch Version (e.g., 1.0): 2.3.0
  • OS (e.g., Linux): Ubuntu 22.4
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): N/A
  • Python version: 3.11
  • CUDA/cuDNN version: 12.4.1
  • GPU models and configuration: A100

Additional context

ouj avatar May 22 '24 17:05 ouj