DeepSpeed
DeepSpeed copied to clipboard
[BUG]exit -9 while training with Codegen-6B
Describe the bug
Hi, I am new to deepspeed and training my RL with an actor-critic model(codegen-6B/shared params) and a reference model(codegen-6B). However, I get
exits with return code = -9
while loading my actor model on 8*A100(80G) gpus, which seems impossible.
Here is my calculation period: params for 2 model: 6*2*4Byte=48G, gradient 48G, AdamW optim(first and second order gradient) 48*2=96G. It seems that 192GB overall is enough for training even if it is fp32.
The first question is, why did I get a CUDA OOM issue even with 4*A100(80G) gpus? Are there any calculations I missed? The second question is, why did I get a CPU OOM issue(seems like that) even with 8*32 CPU RAM and 8*A100(80G) gpus?
To Reproduce
- Init two models for RL like this(actor-critic & reference model)
class CodeGenModel(nn.Module):
def __init__(self, model_args):
super().__init__()
self.model = transformers.AutoModelForCausalLM.from_pretrained(
'Salesforce/codegen-6B-mono',
cache_dir='cache',
)
self.first_dropout = nn.Dropout(0.1)
self.summary = nn.Linear(self.model.config.n_embd, 1)
model = CodeGenModel(model_args).cuda()
ref_model = CodeGenModel(model_args).cuda()
- Edit my ds_config:
{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 1,
"contiguous_gradients": true,
"overlap_comm": true,
"allgather_partitions": true,
"reduce_scatter": true,
"allgather_bucket_size": 200000000,
"reduce_bucket_size": 200000000,
"sub_group_size": 1000000000000,
"offload_optimizer": {
"device": "none",
"pin_memory": false
},
"offload_param": {
"device": "none",
"pin_memory": false
}
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"synchronize_checkpoint_boundary": false
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 32,
"gradient_clipping": 1.0,
"steps_per_print": 8,
"wall_clock_breakdown": false
}
- run my training code:
deepspeed --num_gpus 8 --num_nodes 1 rl.py \
--run 1 \
--model_max_length 512 \
--asp 5 \
--ns 10 \
--data_path data/APPS/ \
--model_name_or_path codegen-6B-mono \
--output_dir output/codegen-6B \
--train_batch_size 32 \
--test_batch_size 48 \
--lr 1e-6 \
--kl_coef 0.1 \
--kl_target 1 \
--report_to none \
--deepspeed ds_config.json \
--skip_memory_metrics 0 \
--vf_coef 1e-3
- logging infomation:
[2023-05-05 21:34:54,714] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-05-05 21:34:54,784] [INFO] [runner.py:540:main] cmd = /home/miniconda3/envs/RL/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None rl.py --run 1 --model_max_length 512 --asp 5 --ns 10 --data_path data/APPS/ --model_name_or_path codegen-6B-mono --output_dir output/codegen-6B --train_batch_size 32 --test_batch_size 48 --lr 1e-6 --kl_coef 0.1 --kl_target 1 --report_to none --deepspeed ds_config.json --skip_memory_metrics 0 --vf_coef 1e-3
[2023-05-05 21:35:01,406] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-05 21:35:01,406] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-05 21:35:01,406] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-05 21:35:01,406] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-05 21:35:01,406] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-05 21:35:07,657] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-05 21:36:30,568] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231119
[2023-05-05 21:36:32,094] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231120
[2023-05-05 21:36:34,507] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231121
[2023-05-05 21:36:34,507] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231122
[2023-05-05 21:36:36,271] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231123
[2023-05-05 21:36:38,151] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231124
[2023-05-05 21:36:39,219] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231125
[2023-05-05 21:36:40,354] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231126
[2023-05-05 21:36:41,648] [ERROR] [launch.py:434:sigkill_handler] ['/home/miniconda3/envs/RL/bin/python', '-u', 'rl.py', '--local_rank=7', '--run', '1', '--model_max_length', '512', '--asp', '5', '--ns', '10', '--data_path', 'data/APPS/', '--model_name_or_path', 'codegen-6B-mono', '--output_dir', 'output/codegen-6B', '--train_batch_size', '32', '--test_batch_size', '48', '--lr', '1e-6', '--kl_coef', '0.1', '--kl_target', '1', '--report_to', 'none', '--deepspeed', 'ds_config.json', '--skip_memory_metrics', '0', '--vf_coef', '1e-3'] exits with return code = -9
error: Detected 1 oom-kill event(s) in step 32928.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
Expected behavior Specifically, I applied 8*32G cpu memory on my cluster. When I use free, I get:
total used free shared buff/cache available
Mem: 251 137 45 0 68 112
Swap: 7 7 0
ds_report output
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/miniconda3/envs/RL/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu116
deepspeed install path ........... ['/home/miniconda3/envs/RL/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 20.04.5 LTS
- GPU one machine with 8 A100(80G) gpus
- Python: 3.9.16
- transformers: 4.26.1, torch 1.13.0+cu116, deepspeed 0.9.1
Launcher context with deepspeed, please refer to reproduce section.
Docker context I did not use docker.
Additional context What I have tried:
- smaller batch size: 2,8,16(the same problem, because it happens while loading the model).
- ds_config(the same problem):
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
}
- smaller model(i.e. codegen-350M, working✅)