stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

OOM error while training llama-7b with five V100-32G GPUs

Open chenzuozhou opened this issue 1 year ago • 3 comments

I use five V100-32G GPUs to train fine tune llama-7b and get OOM error every time.

Here is the error messages: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 388.00 MiB (GPU 3; 31.75 GiB total capacity; 28.42 GiB already allocated; 340.94 MiB free; 30.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is the run cmd: CUDA_VISIBLE_DEVICES=0,1,2,3,4 torchrun --nproc_per_node=5 --master_port=23456 train.py
--model_name_or_path /data/alpaca/stanford_alpaca/llama_hf
--data_path ./alpaca_data.json
--fp16 True
--bf16 False
--output_dir /data/alpaca/stanford_alpaca/llama_tf
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 False

chenzuozhou avatar Apr 17 '23 12:04 chenzuozhou

maybe try full_shard offload auto_wrap

kir152 avatar Apr 17 '23 12:04 kir152

maybe try full_shard offload auto_wrap

Thanks for your answer, I use DeepSpeed stage-3 (with offload) solved the OOM issue.

chenzuozhou avatar Apr 17 '23 23:04 chenzuozhou

@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:

torchrun --nproc_per_node=4 --master_port=3030 train.py \
    --model_name_or_path <path> \
    --data_path ./alpaca_data.json \
    --fp16 True \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_opt_param.json"

And I also changed bf16 to fp16 in the deepspeed config file default_opt_param.json.

I am running into a SIGNAL 7 (SIGBUS) error. Please see trace below:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your
 application as needed.
*****************************************
[2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 44537)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44537
[2]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 44538)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44538
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 44536)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44536
=====================================================

Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?

Here are some details about my environment:

  1. nvcc version
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
  1. nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)
  1. pip freeze output:
$pip freeze
absl-py==1.4.0
accelerate==0.18.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.2
attrs==23.1.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
deepspeed==0.9.0
docker-pycreds==0.4.0
filelock==3.11.0
fire==0.5.0
frozenlist==1.3.3
gitdb==4.0.10
GitPython==3.1.31
hjson==3.1.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
joblib==1.2.0
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
ninja==1.11.1
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.27.4
packaging==23.1
pathtools==0.1.2
protobuf==4.22.3
psutil==5.9.4
py-cpuinfo==9.0.0
pydantic==1.10.7
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
rouge-score==0.1.2
sentencepiece==0.1.98
sentry-sdk==1.19.1
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sympy==1.11.1
termcolor==2.2.0
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15
wandb==0.14.2
yarl==1.8.2

If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!

udhavsethi avatar Apr 19 '23 23:04 udhavsethi

I didn't meet the "SIGNAL 7 (SIGBUS)"error, here is my run cmd and it work. I think your error is caused by nccl, so you can try specific gpu devices. CUDA_VISIBLE_DEVICES=0,1,2,3,4 torchrun --nproc_per_node=5 --master_port=23456 train.py
--model_name_or_path
--data_path ./alpaca_data.json
--fp16 True
--output_dir
--num_train_epochs 3
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--deepspeed "./configs/default_offload_opt_param.json"
--tf32 False

chenzuozhou avatar Apr 21 '23 00:04 chenzuozhou

@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:

torchrun --nproc_per_node=4 --master_port=3030 train.py \
    --model_name_or_path <path> \
    --data_path ./alpaca_data.json \
    --fp16 True \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_opt_param.json"

And I also changed bf16 to fp16 in the deepspeed config file default_opt_param.json.

I am running into a SIGNAL 7 (SIGBUS) error. Please see trace below:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your
 application as needed.
*****************************************
[2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 44537)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44537
[2]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 44538)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44538
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 44536)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44536
=====================================================

Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?

Here are some details about my environment:

  1. nvcc version
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
  1. nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)
  1. pip freeze output:
$pip freeze
absl-py==1.4.0
accelerate==0.18.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.2
attrs==23.1.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
deepspeed==0.9.0
docker-pycreds==0.4.0
filelock==3.11.0
fire==0.5.0
frozenlist==1.3.3
gitdb==4.0.10
GitPython==3.1.31
hjson==3.1.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
joblib==1.2.0
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
ninja==1.11.1
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.27.4
packaging==23.1
pathtools==0.1.2
protobuf==4.22.3
psutil==5.9.4
py-cpuinfo==9.0.0
pydantic==1.10.7
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
rouge-score==0.1.2
sentencepiece==0.1.98
sentry-sdk==1.19.1
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sympy==1.11.1
termcolor==2.2.0
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15
wandb==0.14.2
yarl==1.8.2

If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!

if you have 8 gpus, then you want to set nproc_per_node = 8, not 4

tonyzhao6 avatar Apr 22 '23 00:04 tonyzhao6

@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:

torchrun --nproc_per_node=4 --master_port=3030 train.py \
    --model_name_or_path <path> \
    --data_path ./alpaca_data.json \
    --fp16 True \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_opt_param.json"

And I also changed bf16 to fp16 in the deepspeed config file default_opt_param.json.

I am running into a SIGNAL 7 (SIGBUS) error. Please see trace below:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your
 application as needed.
*****************************************
[2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 44537)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44537
[2]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 44538)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44538
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 44536)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44536
=====================================================

Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?

Here are some details about my environment:

  1. nvcc version
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
  1. nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)
  1. pip freeze output:
$pip freeze
absl-py==1.4.0
accelerate==0.18.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.2
attrs==23.1.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
deepspeed==0.9.0
docker-pycreds==0.4.0
filelock==3.11.0
fire==0.5.0
frozenlist==1.3.3
gitdb==4.0.10
GitPython==3.1.31
hjson==3.1.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
joblib==1.2.0
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
ninja==1.11.1
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.27.4
packaging==23.1
pathtools==0.1.2
protobuf==4.22.3
psutil==5.9.4
py-cpuinfo==9.0.0
pydantic==1.10.7
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
rouge-score==0.1.2
sentencepiece==0.1.98
sentry-sdk==1.19.1
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sympy==1.11.1
termcolor==2.2.0
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15
wandb==0.14.2
yarl==1.8.2

If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!

Are you running in k8s? this error maybe caused by environment of k8s.

Carolmelon avatar Jun 07 '23 09:06 Carolmelon

@chenzuozhou, that helped a lot. I was able to start the training, were you able to replicate the results with these parameters? and how long did the training take on 8 v100 batchsize 2 ?

manaspalaparthi avatar Jun 11 '23 03:06 manaspalaparthi

@chenzuozhou, that helped a lot. I was able to start the training, were you able to replicate the results with these parameters? and how long did the training take on 8 v100 batchsize 2 ?

the deepspeed cost about 20 hours per 3 epochs under 8 * v100, batchsize = 3

Carolmelon avatar Jun 11 '23 06:06 Carolmelon

do you come across the issue, Parameter object do not have the attribute comm when using deepspeed zero3?

JianqiaoLu avatar Jul 21 '23 11:07 JianqiaoLu

maybe try full_shard offload auto_wrap

will the code stop printing anything when running this offload? I find that my code got just stuck here for when I use "full_shard offload auto_wrap" to run

JianqiaoLu avatar Jul 21 '23 11:07 JianqiaoLu

@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:

torchrun --nproc_per_node=4 --master_port=3030 train.py \
    --model_name_or_path <path> \
    --data_path ./alpaca_data.json \
    --fp16 True \
    --output_dir output \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "./configs/default_opt_param.json"

And I also changed bf16 to fp16 in the deepspeed config file default_opt_param.json.

I am running into a SIGNAL 7 (SIGBUS) error. Please see trace below:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your
 application as needed.
*****************************************
[2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
  File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 44537)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44537
[2]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 44538)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44538
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-19_16:03:40
  host      : usethi-fullnode-alpaca-finetune-fml5b
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 44536)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 44536
=====================================================

Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?

Here are some details about my environment:

  1. nvcc version
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
  1. nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)
  1. pip freeze output:
$pip freeze
absl-py==1.4.0
accelerate==0.18.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.2
attrs==23.1.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
deepspeed==0.9.0
docker-pycreds==0.4.0
filelock==3.11.0
fire==0.5.0
frozenlist==1.3.3
gitdb==4.0.10
GitPython==3.1.31
hjson==3.1.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
joblib==1.2.0
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
ninja==1.11.1
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.27.4
packaging==23.1
pathtools==0.1.2
protobuf==4.22.3
psutil==5.9.4
py-cpuinfo==9.0.0
pydantic==1.10.7
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
rouge-score==0.1.2
sentencepiece==0.1.98
sentry-sdk==1.19.1
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sympy==1.11.1
termcolor==2.2.0
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15
wandb==0.14.2
yarl==1.8.2

If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!

I have the same setting as this, but I got another error like this: Loading extension module cpu_adam... Time to load cpu_adam op: 3.1073150634765625 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 3.170907497406006 seconds Parameter Offload: Total persistent parameters: 643072 in 242 params [2023-11-12 17:14:54,258] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1941 [2023-11-12 17:14:54,321] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1942 [2023-11-12 17:14:54,321] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1943 [2023-11-12 17:14:55,380] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1944 [2023-11-12 17:14:55,434] [ERROR] [launch.py:321:sigkill_handler] ['/home/wangyidan/anaconda3/envs/LLM/bin/python', '-u', 'main.py', '--local_rank=3', '--model_name', 'llama2-7b-hf', '--model_name_or_path', '../model/llama2-7b-hf', '--fp16', 'True', '--data_path', 'data/train/origin/alpaca_gpt4_data.json', '--p_data_path', 'data/train/poison/refusal_tgoutput_ns5200_from0_seed0.jsonl', '--p_seed', '42', '--p_n_sample', '500', '--p_type', 'refusal', '--output_dir', './output/custom/opt-1-3b-refusal-output-ns500-seed42', '--num_train_epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '16', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '100', '--report_to', 'none', '--deepspeed', './default_offload_opt_param.json', '--tf32', 'False'] exits with return code = -4

redwyd avatar Nov 12 '23 10:11 redwyd