alpaca-lora
alpaca-lora copied to clipboard
SIGBUS Error During Training with Multiple GPUs
Hello,
I have been running the Alpaca LoRA model training using multiple GPUs on my system, but I encountered a SIGBUS (Signal 7) error during training. The issue seems to be related to memory access, but I'm not sure what the exact root cause is.
Here is a brief overview of my setup:
Number of GPUs: 5 System Memory: 64 GB GPUs: Tesla K80 and GeForce GTX 1070 CUDA Version: 11.2 Training using the finetune.py script Modified WORLD_SIZE and --nproc_per_node to 5, and cuda_visible_devices to 0,1,2,3,4 The error message I received is as follows:
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
finetune.py FAILED
--------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-29_22:03:17
host : finetune-job-96q2q
rank : 4 (local_rank: 4)
exitcode : -7 (pid: 10)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 10
==================================================
I also noticed a tokenizer class mismatch warning in the logs, which I am not sure if it is related to the SIGBUS error:
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
I have tried reseating the memory in the GPU rig and verifying system memory usage, but the issue persists. I would appreciate any guidance or suggestions to resolve this issue.
Full log of last run: https://gist.github.com/sfxworks/e44f68ab456c10e6acf7dcf3caafe6e3
Thank you!
The tokenizer warning you can ignore.
Since you have mixed gpus, I'd begin with trying with one kind of GPU only at a time, see what it gives.
I tried with one GPU, the same issue occurs. I am now trying to eliminate the use of the Telsa k80 so I can use cuda 12.1 + driver 530 and see if that helps at all.
Tried many different images...
Last one was the pytorch image.
Multi:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 6) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=================================================
finetune.py FAILED
-------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-30_10:51:51
host : finetune-job-dnpfz
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 6)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 6
=================================================
Single:
Loading checkpoint shards: 0%| | 0/33 [00:00<?, ?it/s]Error invalid device function at line 508 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu
Dockerfile:
#FROM harbor.home.sfxworks.net/docker/smellslikeml/alpaca-lora
FROM harbor.home.sfxworks.net/docker/pytorch/pytorch:latest
RUN apt-get update -y && apt-get install -y git
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
ENV BASE_MODEL=decapoda-research/llama-7b-hf
RUN cp /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
Trying the official image I also get this issue. I wonder if it's memory related https://github.com/facebookresearch/llama/issues/55
Tried many different images...
Last one was the pytorch image.
Multi:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 6) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ================================================= finetune.py FAILED ------------------------------------------------- Failures: <NO_OTHER_FAILURES> ------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-03-30_10:51:51 host : finetune-job-dnpfz rank : 2 (local_rank: 2) exitcode : -7 (pid: 6) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 6 =================================================
Single:
Loading checkpoint shards: 0%| | 0/33 [00:00<?, ?it/s]Error invalid device function at line 508 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu
Dockerfile:
#FROM harbor.home.sfxworks.net/docker/smellslikeml/alpaca-lora FROM harbor.home.sfxworks.net/docker/pytorch/pytorch:latest RUN apt-get update -y && apt-get install -y git WORKDIR /app COPY . . RUN pip install -r requirements.txt ENV BASE_MODEL=decapoda-research/llama-7b-hf RUN cp /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
hey, have you solved the problem already? please share it with us, thx! I just met with the exact god damn error "Signal 7 (SIGBUS) received by PID..."
How many gpus? What command line did you use? (full params please)
torchrun --nproc_per_node=8 --master_port=29005 finetune.py
--base_model='/remote-home/alpaca-llama-lora-7B-13B-65B/llama-13b-hf'
--data_path 'data/data-10w.json'
--output_dir='output/lora-alpaca-13'
--batch_size 128
--micro_batch_size 2
--num_epochs 2
I have encountered the same problem. Can you help me?
Dunno if it helps, but I just hit SIGBUS error with torchrun and it was related to small shared memory (/dev/shm). It might be about not enough RAM available or small shm in docker container/k8s pod as in my case.
Dunno if it helps, but I just hit SIGBUS error with torchrun and it was related to small shared memory (/dev/shm). It might be about not enough RAM available or small shm in docker container/k8s pod as in my case.
Just want to point out that it solved my issue runnnig Pytorch multi-gpu training in a docker container, thanks @ofilip
Can confirm adding 1Gi EmptyDir /dev/shm to my container solved SIGBUS for multi-GPU training with pytorch-lightning ref https://www.sobyte.net/post/2022-04/k8s-pod-shared-memory/
Dunno if it helps, but I just hit SIGBUS error with torchrun and it was related to small shared memory (/dev/shm). It might be about not enough RAM available or small shm in docker container/k8s pod as in my case.
Solves my issue with Pytorch Lightning ddp_fork
mode, could not train with more than 2 gpus.
I added following lines to the kubernetes pod config
volumes:
- emptyDir:
medium: Memory
sizeLimit: 512Mi
name: cache-volume
...
volumeMounts:
- mountPath: /dev/shm
name: cache-volume