alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Open zweny opened this issue 1 year ago • 0 comments

The error I get when trying to run alpaca-lora:

command: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 finetune.py batch_config: batch_size = 256/ micro_batch_size = 32 / env: python=3.10/ torch=2.0.0 / CUDA=11.7 / GPUs=4*3090

Error info: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: finetune.py FAILED Failures: [1]: time : 2023-04-14_06:35:35 host : 32a68c302449 rank : 1 (local_rank: 1) exitcode : -7 (pid: 3112) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3112 [2]: .... [3]: .... Root Cause (first observed failure): [0]: time : 2023-04-14_06:35:35 host : 32a68c302449 rank : 0 (local_rank: 0) exitcode : -7 (pid: 3111) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3111

zweny avatar Apr 14 '23 06:04 zweny