DeepSpeed
DeepSpeed copied to clipboard
[BUG]Spend lots of time on loading model with zero_optimization stage=3
Describe the bug When I use zero optimization(stage=3),It's spend lots of time on loading model. I'm trying to finetune OPT-66B on 2 node,each node contain 8NVIDIA A100-SXM(80G),1TB RAM. I have already finish the training of OPT-30B with zero optimization(stage=2). However this kind of zero optimization will cost (node_nummemory of one model) RAM ,according the doc of deepspeed.
These modifications allow for models that exceed the size of local CPU/GPU memory/NVMe, but fit within the total NVMe capacity (i.e., aggregate CPU or GPU memory or NVMe) across all nodes. Consider initializing a model with one trillion parameters, whose weights occupy two terabytes (TB) in half precision. The initial CPU allocation in full precision requires 4TB of memory per process, and so a system with 8 GPUs per node would need 32TB of CPU memory due to data-parallel redundancies. Instead, by immediately partitioning tensors we remove the redundancies. The result is that regardless of the number of GPUs, we still only require the original 4TB. This allows for a linear increase in model size with the aggregate system memory. For example, if a node has 1TB of memory and 8 GPUs, we could fit a trillion parameter model with 4 nodes and 32 GPUs.
And the output of htop(OPT-30B) show the same result.
1TB RAM is not enough for OPT-66B(since node_num=8)when I use zero-optimization (stage=3) instead,It's extremly slow.
After the deepspeed log output under information,It‘s need 2000 min(By estimating the loading speed of the 350m model, which took 10 minutes at this stage)
10.28.0.57: [INFO|integrations.py:579] 2022-08-08 22:24:18,989 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" 10.28.0.57: wandb: Currently logged in as: zchill (use
wandb login --reloginto force relogin) 10.28.0.57: wandb: wandb version 0.13.1 is available! To upgrade, please run: 10.28.0.57: wandb: $ pip install wandb --upgrade 10.28.0.57: wandb: Tracking run with wandb version 0.12.10 10.28.0.57: wandb: Syncing run *********************************** 10.28.0.57: wandb: ⭐️ View project at**************************** 10.28.0.57: wandb: 🚀 View run at *********************************** 10.28.0.57: wandb: Run data is saved locally in *********************************** 10.28.0.57: wandb: Run
wandb offlineto turn off syncing. 10.28.0.57:
To Reproduce
ds+config.json
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } } , "zero_optimization": { "stage": 3 }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
Expected behavior
I wonder if the speed of this stage can be faster?
And the config that I used have some mistake.
ds_report output ` --------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY]
-------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/xihu/anaconda3/envs/tk-instruct/lib/python3.8/site-packages/torch'] torch version .................... 1.10.0+cu113 torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed install path ........... ['/home/xihu/anaconda3/envs/tk-instruct/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.6.5, unknown, unknown deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3 ` Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: CentOS Linux release 8.2.2004 (Core)
- two machines with x8 A100s(80g) each
- Interconnects:two machines connected with infiniband
- Python version :Python 3.8.13
- Any other relevant info about your setup
Launcher context
CMD="deepspeed --num_nodes 2 --hostfile hostfile --num_gpus 8 --master_port 4586 --master_addr 10.28.0.57 finetune.py ${OPTS}"