alpaca-lora
alpaca-lora copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.errors.ChildFailedError
The error I get when trying to run alpaca-lora:
command: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 finetune.py
batch_config: batch_size = 256/ micro_batch_size = 32 /
env: python=3.10/ torch=2.0.0 / CUDA=11.7 / GPUs=4*3090
Error info: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: finetune.py FAILED Failures: [1]: time : 2023-04-14_06:35:35 host : 32a68c302449 rank : 1 (local_rank: 1) exitcode : -7 (pid: 3112) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3112 [2]: .... [3]: .... Root Cause (first observed failure): [0]: time : 2023-04-14_06:35:35 host : 32a68c302449 rank : 0 (local_rank: 0) exitcode : -7 (pid: 3111) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3111