stanford_alpaca
stanford_alpaca copied to clipboard
OOM error while training llama-7b with five V100-32G GPUs
I use five V100-32G GPUs to train fine tune llama-7b and get OOM error every time.
Here is the error messages: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 388.00 MiB (GPU 3; 31.75 GiB total capacity; 28.42 GiB already allocated; 340.94 MiB free; 30.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Here is the run cmd:
CUDA_VISIBLE_DEVICES=0,1,2,3,4 torchrun --nproc_per_node=5 --master_port=23456 train.py
--model_name_or_path /data/alpaca/stanford_alpaca/llama_hf
--data_path ./alpaca_data.json
--fp16 True
--bf16 False
--output_dir /data/alpaca/stanford_alpaca/llama_tf
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 False
maybe try full_shard offload auto_wrap
maybe try
full_shard offload auto_wrap
Thanks for your answer, I use DeepSpeed stage-3 (with offload) solved the OOM issue.
@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:
torchrun --nproc_per_node=4 --master_port=3030 train.py \
--model_name_or_path <path> \
--data_path ./alpaca_data.json \
--fp16 True \
--output_dir output \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--deepspeed "./configs/default_opt_param.json"
And I also changed bf16
to fp16
in the deepspeed config file default_opt_param.json
.
I am running into a SIGNAL 7 (SIGBUS)
error. Please see trace below:
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your
application as needed.
*****************************************
[2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10
Traceback (most recent call last):
File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
time : 2023-04-19_16:03:40
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 44537)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 44537
[2]:
time : 2023-04-19_16:03:40
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 44538)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 44538
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-19_16:03:40
host : usethi-fullnode-alpaca-finetune-fml5b
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 44536)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 44536
=====================================================
Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?
Here are some details about my environment:
- nvcc version
$nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
- nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())"
(2, 14, 3)
- pip freeze output:
$pip freeze
absl-py==1.4.0
accelerate==0.18.0
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.2
attrs==23.1.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.3
deepspeed==0.9.0
docker-pycreds==0.4.0
filelock==3.11.0
fire==0.5.0
frozenlist==1.3.3
gitdb==4.0.10
GitPython==3.1.31
hjson==3.1.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
joblib==1.2.0
lit==16.0.1
MarkupSafe==2.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
ninja==1.11.1
nltk==3.8.1
numpy==1.24.2
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.27.4
packaging==23.1
pathtools==0.1.2
protobuf==4.22.3
psutil==5.9.4
py-cpuinfo==9.0.0
pydantic==1.10.7
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
rouge-score==0.1.2
sentencepiece==0.1.98
sentry-sdk==1.19.1
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sympy==1.11.1
termcolor==2.2.0
tokenizers==0.13.3
torch==2.0.0
tqdm==4.65.0
transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers
triton==2.0.0
typing_extensions==4.5.0
urllib3==1.26.15
wandb==0.14.2
yarl==1.8.2
If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!
I didn't meet the "SIGNAL 7 (SIGBUS)"error, here is my run cmd and it work. I think your error is caused by nccl, so you can try specific gpu devices.
CUDA_VISIBLE_DEVICES=0,1,2,3,4 torchrun --nproc_per_node=5 --master_port=23456 train.py
--model_name_or_path
--data_path ./alpaca_data.json
--fp16 True
--output_dir
--num_train_epochs 3
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--deepspeed "./configs/default_offload_opt_param.json"
--tf32 False
@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:
torchrun --nproc_per_node=4 --master_port=3030 train.py \ --model_name_or_path <path> \ --data_path ./alpaca_data.json \ --fp16 True \ --output_dir output \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --deepspeed "./configs/default_opt_param.json"
And I also changed
bf16
tofp16
in the deepspeed config filedefault_opt_param.json
.I am running into a
SIGNAL 7 (SIGBUS)
error. Please see trace below:WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10 Traceback (most recent call last): File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module> sys.exit(main()) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== train.py FAILED ----------------------------------------------------- Failures: [1]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 1 (local_rank: 1) exitcode : -7 (pid: 44537) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44537 [2]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 2 (local_rank: 2) exitcode : -7 (pid: 44538) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44538 ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 0 (local_rank: 0) exitcode : -7 (pid: 44536) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44536 =====================================================
Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?
Here are some details about my environment:
- nvcc version
$nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0
- nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())" (2, 14, 3)
- pip freeze output:
$pip freeze absl-py==1.4.0 accelerate==0.18.0 aiohttp==3.8.4 aiosignal==1.3.1 appdirs==1.4.4 async-timeout==4.0.2 attrs==23.1.0 certifi==2022.12.7 charset-normalizer==3.1.0 click==8.1.3 cmake==3.26.3 deepspeed==0.9.0 docker-pycreds==0.4.0 filelock==3.11.0 fire==0.5.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 hjson==3.1.0 huggingface-hub==0.13.4 idna==3.4 Jinja2==3.1.2 joblib==1.2.0 lit==16.0.1 MarkupSafe==2.1.2 mpmath==1.3.0 multidict==6.0.4 networkx==3.1 ninja==1.11.1 nltk==3.8.1 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 openai==0.27.4 packaging==23.1 pathtools==0.1.2 protobuf==4.22.3 psutil==5.9.4 py-cpuinfo==9.0.0 pydantic==1.10.7 PyYAML==6.0 regex==2023.3.23 requests==2.28.2 rouge-score==0.1.2 sentencepiece==0.1.98 sentry-sdk==1.19.1 setproctitle==1.3.2 six==1.16.0 smmap==5.0.0 sympy==1.11.1 termcolor==2.2.0 tokenizers==0.13.3 torch==2.0.0 tqdm==4.65.0 transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers triton==2.0.0 typing_extensions==4.5.0 urllib3==1.26.15 wandb==0.14.2 yarl==1.8.2
If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!
if you have 8 gpus, then you want to set nproc_per_node
= 8, not 4
@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:
torchrun --nproc_per_node=4 --master_port=3030 train.py \ --model_name_or_path <path> \ --data_path ./alpaca_data.json \ --fp16 True \ --output_dir output \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --deepspeed "./configs/default_opt_param.json"
And I also changed
bf16
tofp16
in the deepspeed config filedefault_opt_param.json
.I am running into a
SIGNAL 7 (SIGBUS)
error. Please see trace below:WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10 Traceback (most recent call last): File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module> sys.exit(main()) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== train.py FAILED ----------------------------------------------------- Failures: [1]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 1 (local_rank: 1) exitcode : -7 (pid: 44537) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44537 [2]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 2 (local_rank: 2) exitcode : -7 (pid: 44538) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44538 ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 0 (local_rank: 0) exitcode : -7 (pid: 44536) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44536 =====================================================
Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?
Here are some details about my environment:
- nvcc version
$nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0
- nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())" (2, 14, 3)
- pip freeze output:
$pip freeze absl-py==1.4.0 accelerate==0.18.0 aiohttp==3.8.4 aiosignal==1.3.1 appdirs==1.4.4 async-timeout==4.0.2 attrs==23.1.0 certifi==2022.12.7 charset-normalizer==3.1.0 click==8.1.3 cmake==3.26.3 deepspeed==0.9.0 docker-pycreds==0.4.0 filelock==3.11.0 fire==0.5.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 hjson==3.1.0 huggingface-hub==0.13.4 idna==3.4 Jinja2==3.1.2 joblib==1.2.0 lit==16.0.1 MarkupSafe==2.1.2 mpmath==1.3.0 multidict==6.0.4 networkx==3.1 ninja==1.11.1 nltk==3.8.1 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 openai==0.27.4 packaging==23.1 pathtools==0.1.2 protobuf==4.22.3 psutil==5.9.4 py-cpuinfo==9.0.0 pydantic==1.10.7 PyYAML==6.0 regex==2023.3.23 requests==2.28.2 rouge-score==0.1.2 sentencepiece==0.1.98 sentry-sdk==1.19.1 setproctitle==1.3.2 six==1.16.0 smmap==5.0.0 sympy==1.11.1 termcolor==2.2.0 tokenizers==0.13.3 torch==2.0.0 tqdm==4.65.0 transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers triton==2.0.0 typing_extensions==4.5.0 urllib3==1.26.15 wandb==0.14.2 yarl==1.8.2
If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!
Are you running in k8s? this error maybe caused by environment of k8s.
@chenzuozhou, that helped a lot. I was able to start the training, were you able to replicate the results with these parameters? and how long did the training take on 8 v100 batchsize 2 ?
@chenzuozhou, that helped a lot. I was able to start the training, were you able to replicate the results with these parameters? and how long did the training take on 8 v100 batchsize 2 ?
the deepspeed cost about 20 hours per 3 epochs under 8 * v100, batchsize = 3
do you come across the issue, Parameter object do not have the attribute comm when using deepspeed zero3?
maybe try
full_shard offload auto_wrap
will the code stop printing anything when running this offload? I find that my code got just stuck here for when I use "full_shard offload auto_wrap" to run
@chenzuozhou I am trying to run the fine-tuning code using deepspeed using a similar setting as yours - I have access to eight 32GB V100 GPUs. I am running the same command as given in the README with a few parameter modifications:
torchrun --nproc_per_node=4 --master_port=3030 train.py \ --model_name_or_path <path> \ --data_path ./alpaca_data.json \ --fp16 True \ --output_dir output \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 2000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --deepspeed "./configs/default_opt_param.json"
And I also changed
bf16
tofp16
in the deepspeed config filedefault_opt_param.json
.I am running into a
SIGNAL 7 (SIGBUS)
error. Please see trace below:WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-04-19 16:03:16,896] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 44539 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 44536) of binary: /root/chat-llm/stanford_alpaca/venv/bin/python3.10 Traceback (most recent call last): File "/root/chat-llm/stanford_alpaca/venv/bin/torchrun", line 8, in <module> sys.exit(main()) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/chat-llm/stanford_alpaca/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== train.py FAILED ----------------------------------------------------- Failures: [1]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 1 (local_rank: 1) exitcode : -7 (pid: 44537) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44537 [2]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 2 (local_rank: 2) exitcode : -7 (pid: 44538) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44538 ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-04-19_16:03:40 host : usethi-fullnode-alpaca-finetune-fml5b rank : 0 (local_rank: 0) exitcode : -7 (pid: 44536) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 44536 =====================================================
Did you encounter this error? If not, could you please share some details about your environment so that I could compare them with mine?
Here are some details about my environment:
- nvcc version
$nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0
- nccl version:
$python -c "import torch;print(torch.cuda.nccl.version())" (2, 14, 3)
- pip freeze output:
$pip freeze absl-py==1.4.0 accelerate==0.18.0 aiohttp==3.8.4 aiosignal==1.3.1 appdirs==1.4.4 async-timeout==4.0.2 attrs==23.1.0 certifi==2022.12.7 charset-normalizer==3.1.0 click==8.1.3 cmake==3.26.3 deepspeed==0.9.0 docker-pycreds==0.4.0 filelock==3.11.0 fire==0.5.0 frozenlist==1.3.3 gitdb==4.0.10 GitPython==3.1.31 hjson==3.1.0 huggingface-hub==0.13.4 idna==3.4 Jinja2==3.1.2 joblib==1.2.0 lit==16.0.1 MarkupSafe==2.1.2 mpmath==1.3.0 multidict==6.0.4 networkx==3.1 ninja==1.11.1 nltk==3.8.1 numpy==1.24.2 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 openai==0.27.4 packaging==23.1 pathtools==0.1.2 protobuf==4.22.3 psutil==5.9.4 py-cpuinfo==9.0.0 pydantic==1.10.7 PyYAML==6.0 regex==2023.3.23 requests==2.28.2 rouge-score==0.1.2 sentencepiece==0.1.98 sentry-sdk==1.19.1 setproctitle==1.3.2 six==1.16.0 smmap==5.0.0 sympy==1.11.1 termcolor==2.2.0 tokenizers==0.13.3 torch==2.0.0 tqdm==4.65.0 transformers @ file:///root/chat-llm/stanford_alpaca/temp/transformers triton==2.0.0 typing_extensions==4.5.0 urllib3==1.26.15 wandb==0.14.2 yarl==1.8.2
If there are any other details other than above that you believe might be helpful, please share those too. I would highly appreciate any help or direction!
I have the same setting as this, but I got another error like this: Loading extension module cpu_adam... Time to load cpu_adam op: 3.1073150634765625 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 3.170907497406006 seconds Parameter Offload: Total persistent parameters: 643072 in 242 params [2023-11-12 17:14:54,258] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1941 [2023-11-12 17:14:54,321] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1942 [2023-11-12 17:14:54,321] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1943 [2023-11-12 17:14:55,380] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1944 [2023-11-12 17:14:55,434] [ERROR] [launch.py:321:sigkill_handler] ['/home/wangyidan/anaconda3/envs/LLM/bin/python', '-u', 'main.py', '--local_rank=3', '--model_name', 'llama2-7b-hf', '--model_name_or_path', '../model/llama2-7b-hf', '--fp16', 'True', '--data_path', 'data/train/origin/alpaca_gpt4_data.json', '--p_data_path', 'data/train/poison/refusal_tgoutput_ns5200_from0_seed0.jsonl', '--p_seed', '42', '--p_n_sample', '500', '--p_type', 'refusal', '--output_dir', './output/custom/opt-1-3b-refusal-output-ns500-seed42', '--num_train_epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--gradient_accumulation_steps', '16', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '200', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '100', '--report_to', 'none', '--deepspeed', './default_offload_opt_param.json', '--tf32', 'False'] exits with return code = -4