alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

SIGBUS Error During Training with Multiple GPUs

Open sfxworks opened this issue 1 year ago • 11 comments

Hello,

I have been running the Alpaca LoRA model training using multiple GPUs on my system, but I encountered a SIGBUS (Signal 7) error during training. The issue seems to be related to memory access, but I'm not sure what the exact root cause is.

Here is a brief overview of my setup:

Number of GPUs: 5 System Memory: 64 GB GPUs: Tesla K80 and GeForce GTX 1070 CUDA Version: 11.2 Training using the finetune.py script Modified WORLD_SIZE and --nproc_per_node to 5, and cuda_visible_devices to 0,1,2,3,4 The error message I received is as follows:

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
==================================================
finetune.py FAILED
--------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_22:03:17
  host      : finetune-job-96q2q
  rank      : 4 (local_rank: 4)
  exitcode  : -7 (pid: 10)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 10
==================================================

I also noticed a tokenizer class mismatch warning in the logs, which I am not sure if it is related to the SIGBUS error:

The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.

I have tried reseating the memory in the GPU rig and verifying system memory usage, but the issue persists. I would appreciate any guidance or suggestions to resolve this issue.

Full log of last run: https://gist.github.com/sfxworks/e44f68ab456c10e6acf7dcf3caafe6e3

Thank you!

sfxworks avatar Mar 29 '23 22:03 sfxworks

The tokenizer warning you can ignore.

Since you have mixed gpus, I'd begin with trying with one kind of GPU only at a time, see what it gives.

AngainorDev avatar Mar 30 '23 06:03 AngainorDev

I tried with one GPU, the same issue occurs. I am now trying to eliminate the use of the Telsa k80 so I can use cuda 12.1 + driver 530 and see if that helps at all.

sfxworks avatar Mar 30 '23 07:03 sfxworks

Tried many different images...

Last one was the pytorch image.

Multi:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 6) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=================================================
finetune.py FAILED
-------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_10:51:51
  host      : finetune-job-dnpfz
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 6)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 6
=================================================

Single:

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]Error invalid device function at line 508 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

Dockerfile:

#FROM harbor.home.sfxworks.net/docker/smellslikeml/alpaca-lora
FROM harbor.home.sfxworks.net/docker/pytorch/pytorch:latest
RUN apt-get update -y && apt-get install -y git
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
ENV BASE_MODEL=decapoda-research/llama-7b-hf
RUN cp /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so

sfxworks avatar Mar 30 '23 10:03 sfxworks

Trying the official image I also get this issue. I wonder if it's memory related https://github.com/facebookresearch/llama/issues/55

sfxworks avatar Apr 02 '23 07:04 sfxworks

Tried many different images...

Last one was the pytorch image.

Multi:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 2 (pid: 6) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=================================================
finetune.py FAILED
-------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_10:51:51
  host      : finetune-job-dnpfz
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 6)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 6
=================================================

Single:

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]Error invalid device function at line 508 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

Dockerfile:

#FROM harbor.home.sfxworks.net/docker/smellslikeml/alpaca-lora
FROM harbor.home.sfxworks.net/docker/pytorch/pytorch:latest
RUN apt-get update -y && apt-get install -y git
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
ENV BASE_MODEL=decapoda-research/llama-7b-hf
RUN cp /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so

hey, have you solved the problem already? please share it with us, thx! I just met with the exact god damn error "Signal 7 (SIGBUS) received by PID..."

jeinlee1991 avatar Apr 13 '23 03:04 jeinlee1991

How many gpus? What command line did you use? (full params please)

AngainorDev avatar Apr 13 '23 07:04 AngainorDev

torchrun --nproc_per_node=8 --master_port=29005 finetune.py
--base_model='/remote-home/alpaca-llama-lora-7B-13B-65B/llama-13b-hf'
--data_path 'data/data-10w.json'
--output_dir='output/lora-alpaca-13'
--batch_size 128
--micro_batch_size 2
--num_epochs 2

I have encountered the same problem. Can you help me?

22zhangqian avatar Jun 18 '23 13:06 22zhangqian

Dunno if it helps, but I just hit SIGBUS error with torchrun and it was related to small shared memory (/dev/shm). It might be about not enough RAM available or small shm in docker container/k8s pod as in my case.

ofilip avatar Jun 21 '23 10:06 ofilip

Dunno if it helps, but I just hit SIGBUS error with torchrun and it was related to small shared memory (/dev/shm). It might be about not enough RAM available or small shm in docker container/k8s pod as in my case.

Just want to point out that it solved my issue runnnig Pytorch multi-gpu training in a docker container, thanks @ofilip

charles-lugagne avatar Aug 17 '23 11:08 charles-lugagne

Can confirm adding 1Gi EmptyDir /dev/shm to my container solved SIGBUS for multi-GPU training with pytorch-lightning ref https://www.sobyte.net/post/2022-04/k8s-pod-shared-memory/

ddelange avatar Sep 01 '23 11:09 ddelange

Dunno if it helps, but I just hit SIGBUS error with torchrun and it was related to small shared memory (/dev/shm). It might be about not enough RAM available or small shm in docker container/k8s pod as in my case.

Solves my issue with Pytorch Lightning ddp_fork mode, could not train with more than 2 gpus. I added following lines to the kubernetes pod config

volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 512Mi
    name: cache-volume
...
volumeMounts:
  - mountPath: /dev/shm
    name: cache-volume

Stanvla avatar Sep 27 '23 17:09 Stanvla