DeepSpeed
DeepSpeed copied to clipboard
[BUG]RuntimeError: Step 1 exited with non-zero status 1
Describe the bug A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior: the official doc
python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1
bug
---=== Running Step 1 ===---
Traceback (most recent call last):
File "/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 218, in <module>
main(args)
File "/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 203, in main
launch_cmd(cmd, step_num)
File "/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 192, in launch_cmd
raise RuntimeError(
RuntimeError: Step 1 exited with non-zero status 1
Please see step1 output log in file: output/actor-models/1.3b/training.log
Maybe Your GPU Out-Of-Memory
hi,dear
@Grypse the bellow
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
//site-packages/torch/utils/cpp_extension.py:325: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Your compiler (c++) is not compatible with the compiler Pytorch was built with for this platform, which is g++ on linux. Please use g++ to to compile your extension. Alternatively, you may compile PyTorch from source using c++, and then you can also use c++ to compile your extension.
See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help with compiling PyTorch from source. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
warnings.warn(WRONG_COMPILER_WARNING.format(
Detected CUDA files, patching ldflags
Emitting ninja build file /data/.cache/torch_extensions/py39_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1/2] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" /include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /deepspeed/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
c++: 错误:unrecognized command line option ‘-std=c++14’
c++: 错误:unrecognized command line option ‘-std=c++14’
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "//deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
subprocess.run(
File "//deepspeed/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "//DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in
then I use another Linux server, got
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 14.56 GiB total capacity; 13.30 GiB already allocated; 230.50 MiB free; 13.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
bad news
so the little 1.3b need how much memory ?
@hikerell
@ucas010, yes, 1.3B model with Adam optimizer needs at 1.3*14GB=~18GB of GPU memory. Your error message suggests that your GPU has ~14GB. Can you try multiple GPUs so that ZeRO memory optimizations could help?
@ucas010, for memory analysis see Figure 1 in https://arxiv.org/pdf/1910.02054.pdf
@ucas010, yes, 1.3B model with Adam optimizer needs at 1.3*14GB=~18GB of GPU memory. Your error message suggests that your GPU has ~14GB. Can you try multiple GPUs so that ZeRO memory optimizations could help?
with the default opt-1.3b script config and 1x A100 40G, there is a CUDA OOM error. so I modified opt-1.3b script to limit batch size.
deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b \
--gradient_accumulation_steps 2 --lora_dim 128 --zero_stage $ZERO_STAGE \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
now everything works, but it need ~30GB of GPU memory.
[2023-04-13 12:15:31,180] [INFO] [timer.py:199:stop] epoch=0/micro_step=3120/global_step=1560, RunningAvgSamplesPerSec=23.920492322905034, CurrSamplesPerSec=23.957474851543964, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:37,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=15, lr=[0.0004315618360406618, 0.0004315618360406618], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-13 12:15:37,891] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=1570, RunningAvgSamplesPerSec=23.920438347021395, CurrSamplesPerSec=23.945105514032797, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:37,888] [INFO] [logging.py:96:log_dist] [Rank 0] step=1570, skipped=15, lr=[0.0004315618360406618, 0.0004315618360406618], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-13 12:15:37,891] [INFO] [timer.py:199:stop] epoch=0/micro_step=3140/global_step=1570, RunningAvgSamplesPerSec=23.920438347021395, CurrSamplesPerSec=23.945105514032797, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:44,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=15, lr=[0.00042612547216574543, 0.00042612547216574543], mom=[(0.9, 0.95), (0.9, 0.95)]
[2023-04-13 12:15:44,592] [INFO] [timer.py:199:stop] epoch=0/micro_step=3160/global_step=1580, RunningAvgSamplesPerSec=23.920614039044654, CurrSamplesPerSec=23.953849064281354, MemAllocated=7.77GB, MaxMemAllocated=26.0GB
[2023-04-13 12:15:44,589] [INFO] [logging.py:96:log_dist] [Rank 0] step=1580, skipped=15, lr=[0.00042612547216574543, 0.00042612547216574543], mom=[(0.9, 0.95), (0.9, 0.95)]
it make me confused.
Glad to know things are working.
-
Why do you conclude that 30GB is used? Your log suggests ~26GB max memory usage
-
The extra memory usage is likely due to activation memory. You can enable HF gradient checkpointing to help.
Glad to know things are working.
- Why do you conclude that 30GB is used? Your log suggests ~26GB max memory usage
![]()
- The extra memory usage is likely due to activation memory. You can enable HF gradient checkpointing to help.
Probably a dumb question. I was able to run successfully after adding per_device_train_batch_size and per_device_eval_batch_size following the suggestion above using single_gpu. However, when I want to train with multiple gpus (e.g. 4xV100 32GB), it always has out of memory issue. Do you have any idea about this?
@alibabadoufu in the multi-gpu situation, you can try increasing the --zero_stage
to improve memory usage. Also, enabling --gradient_checkpointing
or --only_optimizer_lora
will reduce memory usage.
can it run on the macbook pro with m1 pro cpu?
can it run on the macbook pro with m1 pro cpu?
Sorry, but we don't currently support M1. I know torch now supports it but the biggest blocker here relates to our custom CUDA kernels that accelerate inference via our HybridEngine.
I have got this same problem: RuntimeError: Error building extension 'fused_adam'
sloved by update gcc version and activate it! it helps !