OpenPCDet icon indicating copy to clipboard operation
OpenPCDet copied to clipboard

semaphore_tracker: There appear to be 4 leaked semaphores to clean up at shutdown len(cache))

Open yeyang1021 opened this issue 3 years ago • 7 comments

How to deal with the problem? When I using this "bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/waymo_models/centerpoint_4frames.yaml" to train waymo dataset.

like this: multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 4 leaked semaphores to clean up at shutdown len(cache))

yeyang1021 avatar Sep 19 '22 01:09 yeyang1021

2022-09-19 10:25:52,268   INFO  GT database has been saved to shared memory                                                          [24/524]
2022-09-19 10:25:52,743   INFO  Loading Waymo dataset
2022-09-19 10:29:48,426   INFO  Total skipped info 0
2022-09-19 10:29:48,426   INFO  Total samples for Waymo dataset: 158081
2022-09-19 10:29:48,435   INFO  Total sampled samples for Waymo dataset: 31617
2022-09-19 10:29:48,438   INFO  Loading training data to shared memory (file limit=35000)
/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown  len(cache))
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 12255) of binary: /home2/yeyang/anaconda3/envs/spconv/bin/python
Traceback (most recent call last):
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2022-09-19_10:36:21
  host      : yeyang
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 12256)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 12256
[2]:

yeyang1021 avatar Sep 19 '22 02:09 yeyang1021

You can reduce the num_of_worker to avoid this problem.

Cedarch avatar Sep 19 '22 07:09 Cedarch

You can reduce the num_of_worker to avoid this problem.

thank you ! There is another problem.

[Exception|implicit_gemm_pair]indices=torch.Size([720000, 4]),bs=4,ss=[41, 1504, 1504],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],st
ride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attache
d in a issue.

yeyang1021 avatar Sep 20 '22 01:09 yeyang1021

You can reduce the num_of_worker to avoid this problem.

Hello,Can you explain why this error is related to the num_of_worker ?

jlqzzz avatar Sep 20 '22 02:09 jlqzzz

You can reduce the num_of_worker to avoid this problem.

thank you ! There is another problem.

[Exception|implicit_gemm_pair]indices=torch.Size([720000, 4]),bs=4,ss=[41, 1504, 1504],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],st
ride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attache
d in a issue.

This error may be caused by the incorrect input shape for spconv, you can check the input shape of spconv layer (Note: the input data's channel of the first spconv layer should be 6 for multi-frame).

Cedarch avatar Sep 20 '22 09:09 Cedarch

You can reduce the num_of_worker to avoid this problem.

Hello,Can you explain why this error is related to the num_of_worker ?

Hi, this error is caused by OOM of CPU, too much workers will cost large memory for multi-frame setting.

Cedarch avatar Sep 20 '22 09:09 Cedarch

You can reduce the num_of_worker to avoid this problem.

thank you ! There is another problem.

[Exception|implicit_gemm_pair]indices=torch.Size([720000, 4]),bs=4,ss=[41, 1504, 1504],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],st
ride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attache
d in a issue.

This error may be caused by the incorrect input shape for spconv, you can check the input shape of spconv layer (Note: the input data's channel of the first spconv layer should be 6 for multi-frame).

Thank you I find this error is from multi-GPUs training. It works when I set NUM_GPUS=1 "bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/waymo_models/centerpoint_4frames.yaml"

yeyang1021 avatar Sep 21 '22 01:09 yeyang1021

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Oct 21 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Nov 04 '22 02:11 github-actions[bot]