streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Suboptimal usage of 8xH100 GPUs - Streaming dataloader speed significantly fluctuates across batches

Open VSehwag opened this issue 9 months ago • 7 comments

Setup

  • Environment: Pytorch 2.3.0, composer 0.22.0, streaming 0.7.4
  • GPU: 8xH100 sxm, BF16 mode

This issue is related #643 but concerns a more subtle issue with Streaming datasets. Over the course of training, we observe that the streaming dataloader speed randomly takes a dip. This is predominant between epochs but also happens at random steps. This is true for all model sizes we tested (a single layer to 1B param model).

Our current setup includes:

  • Streaming dataset with a single stream - dataset size 0.2-2T
  • Data is residing locally (remote set to None) in uncompressed format.
  • Workers=4 and prefetch factor=4 (per gpu), persistent_workers=True, pin_memory=True
  • Global batch=2048, microbatch=256
  • fsdp with gradient_only sharding
  • Process launched with Composer

Our overall setup is same as diffusion training from https://github.com/mosaicml/diffusion i.e., we launch the script using composer run.py --config-path yamls/hydra-yamls --config-name SD-2-base-256.yaml

The epoch size is 652 batches where the dataloader gets stuck and take a lot of time. However the drop is throughput is also there at random steps between epochs. This test was done on a relatively small dataset with 1.3M (652x2048) samples.

x-axis: training steps x-axis: Wall-clock time
Chart 5_25_2024, 11_29_39 AM Chart 5_25_2024, 11_29_39 AM (1)

We also test it with two other datasets with disk size 1T and 2T (and corresponding epochs size of 2k and 5k batches) and observe the same drops in throughput. The plot of the right shows that the dataloader often hangs for more than a minute. Two subtle issues happening here:

  • With the 2x larger dataset, somehow the training throughput is slightly lower. The index.json of both are 1.2M and 3.1M size on disk. Both runs have identical setup.
  • Somehow the fast training with 1T dataset get stuck for longer (wider dips in the green curve).

We have tried a prefetch factor of up to 8 (with eight workers) and 2 (with 2 workers) for both datasets and didn't observe any resolution to the drops. We also ablate the fsdp mode to full_shard and no_shard but the dips in throughput presists. We used a 140M parameter model but the dips are there with another 1B model. We created the mds dataset by processing our raw data using 8 processed and then merged the index.json files.

Somehow the issue is less severe in the single-gpu training. We only observe the drop in throughput at the end of epoch (2k batches). In all previous tests we had used the 8 gpus ddp (with grad_sharding fsdp config). Note that we don't observe a perfectly linear speedup (8.5 batch/s on 8 gpus vs 1.4 batch/s on 1 gpu) which indicated a IO bound and could contribute to the worse dips in multi-gpu setting.

Overall there are three puzzling questions:

  • It's unclear why does streaming datasets performance suddenly drops throughout training? Even more so, as noticed in #643, the drop is most certainly happening between epochs (data is residing locally) but our large scale testing shows that it happens as prominently between epochs too.
  • As the size of dataset doubles (thus index.json of double size) with everything else being identical, why the overall throughput of the training slightly reduces? However with faster throughput somehow the datalodaer gets stuck for longer wall-clock time.
  • Can the drop in throughput at the end of epoch and between epochs be handled with separate fixes. The former seems to be related to some configs in streaming datasets (as it happens even in single gpu when the training is not certainly I/O bound). The dips between training epochs on multi-gpu could be because of I/O bound on disk (though unclear).

VSehwag avatar May 25 '24 19:05 VSehwag