Firefly icon indicating copy to clipboard operation
Firefly copied to clipboard

多卡训练报错。。。。

Open smartswordsman opened this issue 1 year ago • 3 comments

你好,我在多卡训练中遇到如下错误,不知道怎么解决呢?望回复,谢谢!: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53738 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53740 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53741 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53742 closing signal SIGHUP Traceback (most recent call last): File "/home/huchangyou/anaconda3/envs/firefly/bin/torchrun", line 8, in sys.exit(main()) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent result = agent.run() File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/home/huchangyou/anaconda3/envs/firefly/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 53672 got signal: 1

smartswordsman avatar Jul 01 '23 01:07 smartswordsman

遇到类似情况,请问有解决办法么? 437b1bb3e05a1546172f6cb258a6376

grantchenhuarong avatar Jul 31 '23 13:07 grantchenhuarong

网上说nohup后台不灵光,如果非正常exit终端的话,会将sigterm信号送给进程,最终导致全部中止。

一是退出终端不马上关,使用exit退出;二是看试试这个指令。 $ nohup bash train.sh > train.log 2>&1 & $ disown 这样就算断开连接,命令也会继续运行。

grantchenhuarong avatar Jul 31 '23 14:07 grantchenhuarong

网上说nohup后台不灵光,如果非正常exit终端的话,会将sigterm信号送给进程,最终导致全部中止。

一是退出终端不马上关,使用exit退出;二是看试试这个指令。 $ nohup bash train.sh > train.log 2>&1 & $ disown 这样就算断开连接,命令也会继续运行。

我这里用exit退出就可以了。昨天没注意这个直接把电脑关机白训练一晚上了。

Joe-Hall-Lee avatar Feb 10 '24 03:02 Joe-Hall-Lee