DeepSpeed
DeepSpeed copied to clipboard
[BUG] CPU Offloading Super Slow Even on a 60M Parameter Model?
Describe the bug
I am doing LLM fine-tuning for the bigscience/t0-3b model
, and after 8 hours, it was stalling. Out of curiosity, I switched to t5-small
, and noticed it was still stalling. By way of comparison, using DDP, fine-tuning for one epoch of a 60M parameter model should take a couple minutes.
To run this training session, I am performing multi-node training using 6 g5.4xlarge A10 GPUs on AWS Batch.
Details:
- I am using DeepSpeedCPUAdam as the optimizer
- Precision is bf16
- Using ZeRO stage 3
To Reproduce I have a reproducible example here - https://github.com/rileyhun/llm_finetuning_metaflow
I am using Metaflow with PyTorch Lightning to run a multi-node distributed training job using AWS Batch
Expected behavior I was expecting the training loop to complete for 1 epoch is reasonable time.
ds_report output
Please run ds_report
to give us details about your setup.
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]: Amazon Linux 2
- GPU count and types [e.g. two machines with x8 A100s each] - 6 A10 GPUs
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version: 3.10.4
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Using PyTorch Lightning
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.