ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[DOC]:

Open Johnson-yue opened this issue 3 years ago • 0 comments

📚 The doc issue

How to run colossal AI in background??

I was link the remote GPU , when I used nohup colossalai run --nproc_per_node 1 colossai_train.py --config configs/colossai/config.py 1>logs/colossalai_train.log &

it worked well, But when I close the terminal , the program is closed meanwhile.

How to run it in background???

the log :

Epoch: 0, iteration: 84, loss: 0.335686: 0%| | 88/139138 [00:43<17:50:33, 2.16it/s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 269945 closing signal SIGHUP Traceback (most recent call last): File "/home/chenzd/anaconda3/envs/py39yq/bin/torchrun", line 8, in sys.exit(main()) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, **kwargs) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run result = self._invoke_run(role) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_run time.sleep(monitor_interval) File "/home/chenzd/anaconda3/envs/py39yq/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 269904 got signal: 1 Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job colossai_train_hand64.py --config configs/colossai/hand64_config.py on 127.0.0.1

Johnson-yue avatar Sep 23 '22 03:09 Johnson-yue