Chinese-LLaMA-Alpaca
Chinese-LLaMA-Alpaca copied to clipboard
llama 13B模型训练,4张显卡正常训练,使用8张显卡训练报错
训练脚本没有改 报错内容:
`05/18/2023 14:22:59 - WARNING - datasets.arrow_dataset - Loading cached split indices for dataset at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245/train/cache-37f5bb9ea7f0a4d6.arrow and /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245/train/cache-3e74c0449552ea77.arrow
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3567 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3570 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 3565) of binary:
/workspace/fumengen/vir_fme/bin/python3.10
Traceback (most recent call last):
File "/workspace/fumengen/vir_fme/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_clm_pt_with_peft.py FAILED
Failures: [1]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 1 (local_rank: 1) exitcode : -7 (pid: 3566) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3566 [2]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 3 (local_rank: 3) exitcode : -7 (pid: 3568) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3568 [3]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 4 (local_rank: 4) exitcode : -7 (pid: 3569) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3569
Root Cause (first observed failure): [0]: time : 2023-05-18_14:23:05 host : 7b4985eaff35 rank : 0 (local_rank: 0) exitcode : -7 (pid: 3565) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 3565
`
数据集加载处挂了,是不是内存不足?
1T内存 肯定是够了
@airaria 我排查之后,发现是使用docker 容器的原因,我直接在服务器创建虚拟环境进行训练正常运行, 进入docker容器训练就报错,请问你们有过这种问题吗,还是我docker没装好?
@airaria 我排查之后,发现是使用docker 容器的原因,我直接在服务器创建虚拟环境进行训练正常运行, 进入docker容器训练就报错,请问你们有过这种问题吗,还是我docker没装好?
我们没有遇到过这种问题,可能还是环境问题吧
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.
1T内存 肯定是够了
请问怎么一机多卡训练?
@airaria 我排查之后,发现是使用docker 容器的原因,我直接在服务器创建虚拟环境进行训练正常运行, 进入docker容器训练就报错,请问你们有过这种问题吗,还是我docker没装好?
我也遇到了同样的问题,单机4卡没问题,单机6卡和8卡都会报错。这个和你构建docker容器有关,构建容器时把--network=host参数去掉,用-p少开几个端口,就可以单机8卡了。原因猜测是容器端口和宿主机端口冲突的问题导致的。