DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] CPU Offloading Super Slow Even on a 60M Parameter Model?

Open rileyhun opened this issue 1 year ago • 0 comments

Describe the bug I am doing LLM fine-tuning for the bigscience/t0-3b model, and after 8 hours, it was stalling. Out of curiosity, I switched to t5-small, and noticed it was still stalling. By way of comparison, using DDP, fine-tuning for one epoch of a 60M parameter model should take a couple minutes.

To run this training session, I am performing multi-node training using 6 g5.4xlarge A10 GPUs on AWS Batch.

Details:

  • I am using DeepSpeedCPUAdam as the optimizer
  • Precision is bf16
  • Using ZeRO stage 3

To Reproduce I have a reproducible example here - https://github.com/rileyhun/llm_finetuning_metaflow

I am using Metaflow with PyTorch Lightning to run a multi-node distributed training job using AWS Batch

Expected behavior I was expecting the training loop to complete for 1 epoch is reasonable time.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]: Amazon Linux 2
  • GPU count and types [e.g. two machines with x8 A100s each] - 6 A10 GPUs
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version: 3.10.4
  • Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? Using PyTorch Lightning

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

rileyhun avatar Jun 08 '23 18:06 rileyhun