DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG]RuntimeError: Step 1 exited with non-zero status 1

Open ucas010 opened this issue 1 year ago • 13 comments

Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior: the official doc

python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1

bug

---=== Running Step 1 ===---
Traceback (most recent call last):
  File "/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 218, in <module>
    main(args)
  File "/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 203, in main
    launch_cmd(cmd, step_num)
  File "/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 192, in launch_cmd
    raise RuntimeError(
RuntimeError: Step 1 exited with non-zero status 1

ucas010 avatar Apr 13 '23 05:04 ucas010

Please see step1 output log in file: output/actor-models/1.3b/training.log

Grypse avatar Apr 13 '23 06:04 Grypse

Maybe Your GPU Out-Of-Memory

hikerell avatar Apr 13 '23 07:04 hikerell

hi,dear @Grypse the bellow huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) //site-packages/torch/utils/cpp_extension.py:325: UserWarning:

                           !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++) is not compatible with the compiler Pytorch was built with for this platform, which is g++ on linux. Please use g++ to to compile your extension. Alternatively, you may compile PyTorch from source using c++, and then you can also use c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help with compiling PyTorch from source. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                          !! WARNING !!

warnings.warn(WRONG_COMPILER_WARNING.format( Detected CUDA files, patching ldflags Emitting ninja build file /data/.cache/torch_extensions/py39_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [1/2] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" /include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /deepspeed/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o c++: 错误:unrecognized command line option ‘-std=c++14’ c++: 错误:unrecognized command line option ‘-std=c++14’ ninja: build stopped: subcommand failed. Traceback (most recent call last): File "//deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "//deepspeed/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "//DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in main() File "//DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 260, in main optimizer = AdamOptimizer(optimizer_grouped_parameters, File "//deepspeed/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 449, in load return self.jit_load(verbose) File "//deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "//deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile _write_ninja_file_and_build_library( File "//deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library _run_ninja_build( File "/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' [2023-04-13 14:51:47,667] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 13775 [2023-04-13 14:51:47,668] [ERROR] [launch.py:434:sigkill_handler] ['/data/deepspeed/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

ucas010 avatar Apr 13 '23 09:04 ucas010

then I use another Linux server, got RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.56 GiB total capacity; 13.30 GiB already allocated; 230.50 MiB free; 13.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF bad news so the little 1.3b need how much memory ?

ucas010 avatar Apr 13 '23 11:04 ucas010

@hikerell

ucas010 avatar Apr 13 '23 11:04 ucas010

@ucas010, yes, 1.3B model with Adam optimizer needs at 1.3*14GB=~18GB of GPU memory. Your error message suggests that your GPU has ~14GB. Can you try multiple GPUs so that ZeRO memory optimizations could help?

tjruwase avatar Apr 13 '23 12:04 tjruwase

@ucas010, for memory analysis see Figure 1 in https://arxiv.org/pdf/1910.02054.pdf

tjruwase avatar Apr 13 '23 12:04 tjruwase

@ucas010, yes, 1.3B model with Adam optimizer needs at 1.3*14GB=~18GB of GPU memory. Your error message suggests that your GPU has ~14GB. Can you try multiple GPUs so that ZeRO memory optimizations could help?

with the default opt-1.3b script config and 1x A100 40G, there is a CUDA OOM error. so I modified opt-1.3b script to limit batch size.

deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b \
   --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage $ZERO_STAGE \
   --per_device_train_batch_size 8 \
   --per_device_eval_batch_size 8 \
   --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log

now everything works, but it need ~30GB of GPU memory.

[2023-04-13 12:15:31,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=1560, RunningAvgSamplesPerSec=23.920492322905034, CurrSamplesPerSec=23.957474851543964, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:37,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=15, lr=[0.0004315618360406618, 0.0004315618360406618], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-13 12:15:37,891] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=1570, RunningAvgSamplesPerSec=23.920438347021395, CurrSamplesPerSec=23.945105514032797, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:37,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=15, lr=[0.0004315618360406618, 0.0004315618360406618], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-13 12:15:37,891] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=1570, RunningAvgSamplesPerSec=23.920438347021395, CurrSamplesPerSec=23.945105514032797, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:44,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=15, lr=[0.00042612547216574543, 0.00042612547216574543], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-13 12:15:44,592] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=1580, RunningAvgSamplesPerSec=23.920614039044654, CurrSamplesPerSec=23.953849064281354, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:44,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=15, lr=[0.00042612547216574543, 0.00042612547216574543], mom=[(0.9, 0.95), (0.9, 0.95)]

it make me confused.

hikerell avatar Apr 13 '23 12:04 hikerell

Glad to know things are working.

  1. Why do you conclude that 30GB is used? Your log suggests ~26GB max memory usage image

  2. The extra memory usage is likely due to activation memory. You can enable HF gradient checkpointing to help.

tjruwase avatar Apr 13 '23 12:04 tjruwase

Glad to know things are working.

  1. Why do you conclude that 30GB is used? Your log suggests ~26GB max memory usage
image
  1. The extra memory usage is likely due to activation memory. You can enable HF gradient checkpointing to help.

Probably a dumb question. I was able to run successfully after adding per_device_train_batch_size and per_device_eval_batch_size following the suggestion above using single_gpu. However, when I want to train with multiple gpus (e.g. 4xV100 32GB), it always has out of memory issue. Do you have any idea about this?

alibabadoufu avatar Apr 14 '23 16:04 alibabadoufu

@alibabadoufu in the multi-gpu situation, you can try increasing the --zero_stage to improve memory usage. Also, enabling --gradient_checkpointing or --only_optimizer_lora will reduce memory usage.

mrwyattii avatar Apr 14 '23 17:04 mrwyattii

can it run on the macbook pro with m1 pro cpu?

GBChain avatar Apr 15 '23 08:04 GBChain

can it run on the macbook pro with m1 pro cpu?

Sorry, but we don't currently support M1. I know torch now supports it but the biggest blocker here relates to our custom CUDA kernels that accelerate inference via our HybridEngine.

jeffra avatar Apr 18 '23 17:04 jeffra

I have got this same problem: RuntimeError: Error building extension 'fused_adam'

sloved by update gcc version and activate it! it helps !

enddlesswm avatar Oct 10 '23 10:10 enddlesswm