PersFormer_3DLane icon indicating copy to clipboard operation
PersFormer_3DLane copied to clipboard

The training code crashes

Open asadnorouzi opened this issue 3 years ago • 1 comments

The training code crashes without an understandable error message:

proc_id: 5
world size: 6
local_rank: 5
proc_id: 3
world size: 6
local_rank: 3
proc_id: 1
world size: 6
local_rank: 1
proc_id: 4
world size: 6
local_rank: 4
proc_id: 2
world size: 6
local_rank: 2
proc_id: 0
world size: 6
local_rank: 0
Let's use 6 GPUs!
Loading Dataset ...
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
mean_cam_height 2.115622443531824, mean_cam_pitch 0.0
Killing subprocess 6040
Killing subprocess 6041
Killing subprocess 6042
Killing subprocess 6043
Killing subprocess 6044
Killing subprocess 6047
Traceback (most recent call last):
  File "/root/anaconda3/envs/persformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/persformer/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/persformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/root/anaconda3/envs/persformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/root/anaconda3/envs/persformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/persformer/bin/python', '-u', 'main_persformer.py', '--local_rank=5', '--mod=FIRST_RUN', '--batch_size=24']' died with <Signals.SIGKILL: 9>.

The run command I executed is python -m torch.distributed.launch --nproc_per_node 6 main_persformer.py --mod=FIRST_RUN --batch_size=24

Any idea?

asadnorouzi avatar Jul 26 '22 21:07 asadnorouzi

It seems that the data volume is contributing to this error. If you train with the 300 subset of the dataset, it will not crash! I still need to figure out the exact cause of it. I will do so at a later time. If anyone else has any suggestions please post here.

asadnorouzi avatar Jul 28 '22 18:07 asadnorouzi

I forgot to update this issue earlier! The error is when your machine does not have enough memory for data loader to cache the data. If this happens to you and you do not have access to a better machine for training, you can have the data loader to load the data in separate steps. It will eventually save them in a temp folder (.cache) and you are good to go after that step is completed. I am closing this issue.

asadnorouzi avatar Aug 22 '22 20:08 asadnorouzi