DeepSpeed [BUG]exit -9 while training with Codegen-6B

[BUG]exit -9 while training with Codegen-6B

Open Switchsyj opened this issue 1 year ago • 0 comments

Describe the bug Hi, I am new to deepspeed and training my RL with an actor-critic model(codegen-6B/shared params) and a reference model(codegen-6B). However, I get exits with return code = -9 while loading my actor model on 8*A100(80G) gpus, which seems impossible. Here is my calculation period: params for 2 model: 6*2*4Byte=48G, gradient 48G, AdamW optim(first and second order gradient) 48*2=96G. It seems that 192GB overall is enough for training even if it is fp32.

The first question is, why did I get a CUDA OOM issue even with 4*A100(80G) gpus? Are there any calculations I missed? The second question is, why did I get a CPU OOM issue(seems like that) even with 8*32 CPU RAM and 8*A100(80G) gpus?

To Reproduce

Init two models for RL like this(actor-critic & reference model)

class CodeGenModel(nn.Module):
    def __init__(self, model_args):
        super().__init__()
        self.model = transformers.AutoModelForCausalLM.from_pretrained(
                    'Salesforce/codegen-6B-mono',
                    cache_dir='cache',
                    )
        self.first_dropout = nn.Dropout(0.1)
        self.summary = nn.Linear(self.model.config.n_embd, 1)

model = CodeGenModel(model_args).cuda()
ref_model = CodeGenModel(model_args).cuda()

Edit my ds_config:

{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 1,
        "contiguous_gradients": true, 
        "overlap_comm": true, 
        "allgather_partitions": true, 
        "reduce_scatter": true, 
        "allgather_bucket_size": 200000000, 
        "reduce_bucket_size": 200000000, 
        "sub_group_size": 1000000000000,
        "offload_optimizer": {
            "device": "none", 
            "pin_memory": false
        },
        "offload_param": {
            "device": "none", 
            "pin_memory": false
        }
    },
    "activation_checkpointing": {
        "partition_activations": true, 
        "cpu_checkpointing": false, 
        "contiguous_memory_optimization": false, 
        "synchronize_checkpoint_boundary": false
    },
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32,
    "gradient_clipping": 1.0,
    "steps_per_print": 8,
    "wall_clock_breakdown": false
}

run my training code:

deepspeed --num_gpus 8 --num_nodes 1 rl.py \
    --run 1 \
    --model_max_length 512 \
    --asp 5 \
    --ns 10 \
    --data_path data/APPS/ \
    --model_name_or_path codegen-6B-mono \
    --output_dir output/codegen-6B \
    --train_batch_size 32 \
    --test_batch_size 48 \
    --lr 1e-6 \
    --kl_coef 0.1 \
    --kl_target 1 \
    --report_to none \
    --deepspeed ds_config.json \
    --skip_memory_metrics 0 \
    --vf_coef 1e-3

logging infomation:

[2023-05-05 21:34:54,714] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-05-05 21:34:54,784] [INFO] [runner.py:540:main] cmd = /home/miniconda3/envs/RL/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None rl.py --run 1 --model_max_length 512 --asp 5 --ns 10 --data_path data/APPS/ --model_name_or_path codegen-6B-mono --output_dir output/codegen-6B --train_batch_size 32 --test_batch_size 48 --lr 1e-6 --kl_coef 0.1 --kl_target 1 --report_to none --deepspeed ds_config.json --skip_memory_metrics 0 --vf_coef 1e-3
[2023-05-05 21:35:01,406] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-05 21:35:01,406] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-05 21:35:01,406] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-05 21:35:01,406] [INFO] [launch.py:247:main] dist_world_size=8
[2023-05-05 21:35:01,406] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-05 21:35:07,657] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-05 21:36:30,568] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231119
[2023-05-05 21:36:32,094] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231120
[2023-05-05 21:36:34,507] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231121
[2023-05-05 21:36:34,507] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231122
[2023-05-05 21:36:36,271] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231123
[2023-05-05 21:36:38,151] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231124
[2023-05-05 21:36:39,219] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231125
[2023-05-05 21:36:40,354] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 231126
[2023-05-05 21:36:41,648] [ERROR] [launch.py:434:sigkill_handler] ['/home/miniconda3/envs/RL/bin/python', '-u', 'rl.py', '--local_rank=7', '--run', '1', '--model_max_length', '512', '--asp', '5', '--ns', '10', '--data_path', 'data/APPS/', '--model_name_or_path', 'codegen-6B-mono', '--output_dir', 'output/codegen-6B', '--train_batch_size', '32', '--test_batch_size', '48', '--lr', '1e-6', '--kl_coef', '0.1', '--kl_target', '1', '--report_to', 'none', '--deepspeed', 'ds_config.json', '--skip_memory_metrics', '0', '--vf_coef', '1e-3'] exits with return code = -9
error: Detected 1 oom-kill event(s) in step 32928.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Expected behavior Specifically, I applied 8*32G cpu memory on my cluster. When I use free, I get:

              total        used        free      shared  buff/cache   available
Mem:            251         137          45           0          68         112
Swap:             7           7           0

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/miniconda3/envs/RL/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu116
deepspeed install path ........... ['/home/miniconda3/envs/RL/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04.5 LTS
GPU one machine with 8 A100(80G) gpus
Python: 3.9.16
transformers: 4.26.1, torch 1.13.0+cu116, deepspeed 0.9.1

Launcher context with deepspeed, please refer to reproduce section.

Docker context I did not use docker.

Additional context What I have tried:

smaller batch size: 2,8,16(the same problem, because it happens while loading the model).
ds_config(the same problem):

        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": false
        },
        "offload_param": {
            "device": "cpu", 
            "pin_memory": false
        }

smaller model(i.e. codegen-350M, working✅)

May 05 '23 14:05 Switchsyj

DeepSpeed DeepSpeed copied to clipboard

[BUG]exit -9 while training with Codegen-6B

DeepSpeed
DeepSpeed copied to clipboard