The training code crashes
The training code crashes without an understandable error message:
proc_id: 5
world size: 6
local_rank: 5
proc_id: 3
world size: 6
local_rank: 3
proc_id: 1
world size: 6
local_rank: 1
proc_id: 4
world size: 6
local_rank: 4
proc_id: 2
world size: 6
local_rank: 2
proc_id: 0
world size: 6
local_rank: 0
Let's use 6 GPUs!
Loading Dataset ...
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
Killing subprocess 6040
Killing subprocess 6041
Killing subprocess 6042
Killing subprocess 6043
Killing subprocess 6044
Killing subprocess 6047
Traceback (most recent call last):
File "/root/anaconda3/envs/persformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/persformer/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/persformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/root/anaconda3/envs/persformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/root/anaconda3/envs/persformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/persformer/bin/python', '-u', 'main_persformer.py', '--local_rank=5', '--mod=FIRST_RUN', '--batch_size=24']' died with <Signals.SIGKILL: 9>.
The run command I executed is python -m torch.distributed.launch --nproc_per_node 6 main_persformer.py --mod=FIRST_RUN --batch_size=24
Any idea?
It seems that the data volume is contributing to this error. If you train with the 300 subset of the dataset, it will not crash! I still need to figure out the exact cause of it. I will do so at a later time. If anyone else has any suggestions please post here.
I forgot to update this issue earlier! The error is when your machine does not have enough memory for data loader to cache the data. If this happens to you and you do not have access to a better machine for training, you can have the data loader to load the data in separate steps. It will eventually save them in a temp folder (.cache) and you are good to go after that step is completed. I am closing this issue.