DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Get "exits with return code = -9" when Creating fp16 ZeRO stage 2 optimizer

Open sxthunder opened this issue 1 year ago • 3 comments

I am training a 10B model using deepspeed with megatron on A100 GPUS(80G). Here is my ds_report

image

If I use 4 GPUS, the error is CUDA out of memory in deepspeed initialize, the traceback is : Traceback (most recent call last): File "/workspace/bin/codegeex/megatron/tools/pretrain_codegeex.py", line 207, in <module> pretrain( File "/workspace/bin/codegeex/megatron/training.py", line 152, in pretrain model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/workspace/bin/codegeex/megatron/training.py", line 419, in setup_model_and_optimizer model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize engine = DeepSpeedEngine(args=args, File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1082, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1334, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 503, in __init__ self.initialize_optimizer_states() File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 586, in initialize_optimizer_states self.optimizer.step() File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 149, in step state['exp_avg_sq'] = torch.zeros_like(p.data) RuntimeError: CUDA out of memory. Tried to allocate 11.99 GiB (GPU 0; 79.35 GiB total capacity; 59.95 GiB already allocated; 8.52 GiB free; 59.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

When I use 8 GPUS, the error happens earlier, In "Creating fp16 ZeRO stage 2 optimizer", the log is : [2023-02-18 11:14:26,616] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer [2023-02-18 11:14:26,616] [INFO] [stage_1_and_2.py:132:__init__] Reduce bucket size 50000000 [2023-02-18 11:14:26,616] [INFO] [stage_1_and_2.py:133:__init__] Allgather bucket size 50000000 [2023-02-18 11:14:26,616] [INFO] [stage_1_and_2.py:134:__init__] CPU Offload: False [2023-02-18 11:14:26,616] [INFO] [stage_1_and_2.py:135:__init__] Round robin gradient partitioning: False Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py38_cu113/utils... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o [2/2] c++ flatten_unflatten.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so Loading extension module utils... Time to load utils op: 17.926154613494873 seconds Loading extension module utils... Time to load utils op: 17.922656297683716 seconds Loading extension module utils... Time to load utils op: 17.92318892478943 seconds Loading extension module utils... Time to load utils op: 17.922754287719727 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 17.922405004501343 seconds Time to load utils op: 17.922232389450073 seconds Loading extension module utils... Time to load utils op: 17.92255711555481 seconds Loading extension module utils... Time to load utils op: 17.92217516899109 seconds [2023-02-18 11:14:54,208] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 707 [2023-02-18 11:14:54,208] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 708 [2023-02-18 11:14:54,209] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 709 [2023-02-18 11:14:54,210] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 710 [2023-02-18 11:14:54,210] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 711 [2023-02-18 11:14:54,210] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 712 [2023-02-18 11:14:54,211] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 713 [2023-02-18 11:14:54,215] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 714 then the process is end with a code=-9.

Why the error happens earlier when I increase the Gpu nums from 4 to 8? the latter error didn't happen when I use 4 GPUs

sxthunder avatar Feb 18 '23 03:02 sxthunder

Same error. I use 4 * 8 A100(80GB) to train the GPT-2 model of 100B. Enable ZeRO-3 in the training script. I ran into this problem #2185 first. Then I manually comment out the warning, and then execute the following error report.

10.0.1.50: [2023-02-18 07:41:18,851] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information
10.0.1.50: [2023-02-18 07:41:18,851] [INFO] [checkpointing.py:548:forward] ----Partition Activations True, CPU CHECKPOINTING True
10.0.1.50: [2023-02-18 07:41:18,851] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing True with 250 total layers
10.0.1.50: [2023-02-18 07:41:18,851] [INFO] [checkpointing.py:554:forward] ----Synchronization True
10.0.1.50: [2023-02-18 07:41:18,851] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False
10.0.1.51: [2023-02-18 07:41:42,084] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54337
10.0.1.51: [2023-02-18 07:41:42,086] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54338
10.0.1.51: [2023-02-18 07:41:42,086] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54339
10.0.1.51: [2023-02-18 07:41:42,087] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54340
10.0.1.51: [2023-02-18 07:41:42,087] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54341
10.0.1.51: [2023-02-18 07:41:42,087] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54342
10.0.1.51: [2023-02-18 07:41:42,087] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54343
10.0.1.51: [2023-02-18 07:41:42,087] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 54344
10.0.1.51: [2023-02-18 07:41:42,087] [ERROR] [launch.py:184:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'pretrain_gpt2.py', '--local_rank=7', '--model-parallel-size', '1', '--num-layers', '125', '--hidden-size', '8192', '--num-attention-heads', '32', '--seq-length', '2048', '--max-position-embeddings', '2048', '--batch-size', '8', '--train-iters', '15', '--lr-decay-iters', '15', '--save', '/workdir/share/ayx_share/checkpoints/gpt2_10b_ds', '--load', '/workdir/share/ayx_share/checkpoints/gpt2_10b_ds', '--data-path', '/workdir/share/sota_analysis/books1_text_document', '--vocab-file', '/workdir/share/sota_analysis/gpt2-vocab.json', '--merge-file', '/workdir/share/sota_analysis/gpt2-merges.txt', '--data-impl', 'mmap', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '1.5e-4', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--save-interval', '15', '--eval-interval', '15', '--eval-iters', '1', '--fp16', '--scattered-embeddings', '--split-transformers', '--deepspeed', '--deepspeed_config', '/workdir/share/ayx_share/ds_zero_stage_3_config.json', '--zero-stage', '3', '--zero-reduce-bucket-size', '50000000', '--zero-allgather-bucket-size', '5000000000', '--zero-contigious-gradients', '--zero-reduce-scatter', '--deepspeed-activation-checkpointing', '--checkpoint-num-layers', '1', '--partition-activations', '--checkpoint-in-cpu', '--synchronize-each-layer', '--contigious-checkpointing'] exits with return code = -9
pdsh@69211c0bd383: 10.0.1.51: ssh exited with exit code 247


ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.13.0.dev20220719+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.3

zincnode avatar Feb 18 '23 07:02 zincnode

I think it probably stands for CPU OOM, as mentioned #2788 . I use the same environment and the same configuration, and only change the model size (100B->50B), there is no error.

zincnode avatar Feb 18 '23 09:02 zincnode

@sxthunder, thanks for reporting this issue. Let's start with the 4 GPU case. Can you please share your ds_config and log prior to failure? Thanks!

tjruwase avatar Feb 21 '23 13:02 tjruwase

I solve this by setting tensor parallel = 2

sxthunder avatar Feb 26 '23 02:02 sxthunder

@sxthunder hello, i meet the same problem, could you please give me some solution ideas? i am new to deepspeed and i cannot find how to set tensor parallel in official document. thanks.

lecchon avatar Jul 13 '23 02:07 lecchon