Chinese-LLaMA-Alpaca icon indicating copy to clipboard operation
Chinese-LLaMA-Alpaca copied to clipboard

llama 13B模型训练,4张显卡正常训练,使用8张显卡训练报错

Open ccdf1137 opened this issue 1 year ago • 4 comments

训练脚本没有改 报错内容:

`05/18/2023 14:22:59 - WARNING - datasets.arrow_dataset - Loading cached split indices for dataset at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245/train/cache-37f5bb9ea7f0a4d6.arrow and /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245/train/cache-3e74c0449552ea77.arrow WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3567 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3570 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3565) of binary: /workspace/fumengen/vir_fme/bin/python3.10 Traceback (most recent call last): File "/workspace/fumengen/vir_fme/bin/torchrun", line 8, in sys.exit(main()) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_pt_with_peft.py FAILED

Failures: [1]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 1 (local_rank: 1) exitcode : -7 (pid: 3566) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3566 [2]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 3 (local_rank: 3) exitcode : -7 (pid: 3568) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3568 [3]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 4 (local_rank: 4) exitcode : -7 (pid: 3569) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3569

Root Cause (first observed failure): [0]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 0 (local_rank: 0) exitcode : -7 (pid: 3565) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3565

`

ccdf1137 avatar May 18 '23 07:05 ccdf1137

数据集加载处挂了,是不是内存不足?

airaria avatar May 18 '23 08:05 airaria

1T内存 肯定是够了

ccdf1137 avatar May 19 '23 01:05 ccdf1137

@airaria 我排查之后,发现是使用docker 容器的原因,我直接在服务器创建虚拟环境进行训练正常运行, 进入docker容器训练就报错,请问你们有过这种问题吗,还是我docker没装好?

ccdf1137 avatar May 19 '23 10:05 ccdf1137

@airaria 我排查之后,发现是使用docker 容器的原因,我直接在服务器创建虚拟环境进行训练正常运行, 进入docker容器训练就报错,请问你们有过这种问题吗,还是我docker没装好?

我们没有遇到过这种问题,可能还是环境问题吧

airaria avatar May 19 '23 12:05 airaria

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] avatar May 26 '23 22:05 github-actions[bot]

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

github-actions[bot] avatar May 29 '23 22:05 github-actions[bot]

1T内存 肯定是够了

请问怎么一机多卡训练?

zzx528 avatar Jun 18 '23 04:06 zzx528

@airaria 我排查之后,发现是使用docker 容器的原因,我直接在服务器创建虚拟环境进行训练正常运行, 进入docker容器训练就报错,请问你们有过这种问题吗,还是我docker没装好?

我也遇到了同样的问题,单机4卡没问题,单机6卡和8卡都会报错。这个和你构建docker容器有关,构建容器时把--network=host参数去掉,用-p少开几个端口,就可以单机8卡了。原因猜测是容器端口和宿主机端口冲突的问题导致的。

louiss007 avatar Sep 13 '23 13:09 louiss007