openchat icon indicating copy to clipboard operation
openchat copied to clipboard

Training with Deepspeed ZeRO-3

Open TheBlackHacker opened this issue 1 year ago • 0 comments

Hello, I know a lot of people want to train OpenChat model but accessing modern GPUs like A100s or H100s is seem difficult. So I tried using ZeRO-3 to train on a cheaper GPU system, 8xA10G with 24GB VRAM.

What I added in this pull request:

ochat/training_deepspeed/deepspeed_config_zero3.json

- Change ZeRO-2 strategy to ZeRO-3.
- Offload to CPU (If you want to use nvme, please edit in the config file).

ochat/training_deepspeed/train_zero3.py

- Change optimizer from torch.optim.AdamW to deepspeed.ops.adam.DeepSpeedCPUAdam
- Save model (in 16bit) on all rank.

README.md

- Added instructions for training with ZeRO-3.

TheBlackHacker avatar Nov 25 '23 08:11 TheBlackHacker