OpenPCDet
OpenPCDet copied to clipboard
semaphore_tracker: There appear to be 4 leaked semaphores to clean up at shutdown len(cache))
How to deal with the problem? When I using this "bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/waymo_models/centerpoint_4frames.yaml" to train waymo dataset.
like this: multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 4 leaked semaphores to clean up at shutdown len(cache))
2022-09-19 10:25:52,268 INFO GT database has been saved to shared memory [24/524]
2022-09-19 10:25:52,743 INFO Loading Waymo dataset
2022-09-19 10:29:48,426 INFO Total skipped info 0
2022-09-19 10:29:48,426 INFO Total samples for Waymo dataset: 158081
2022-09-19 10:29:48,435 INFO Total sampled samples for Waymo dataset: 31617
2022-09-19 10:29:48,438 INFO Loading training data to shared memory (file limit=35000)
/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown len(cache))
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 12255) of binary: /home2/yeyang/anaconda3/envs/spconv/bin/python
Traceback (most recent call last):
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home2/yeyang/anaconda3/envs/spconv/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
[1]:
time : 2022-09-19_10:36:21
host : yeyang
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 12256)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 12256
[2]:
You can reduce the num_of_worker to avoid this problem.
You can reduce the num_of_worker to avoid this problem.
thank you ! There is another problem.
[Exception|implicit_gemm_pair]indices=torch.Size([720000, 4]),bs=4,ss=[41, 1504, 1504],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],st
ride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False
SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attache
d in a issue.
You can reduce the num_of_worker to avoid this problem.
Hello,Can you explain why this error is related to the num_of_worker ?
You can reduce the num_of_worker to avoid this problem.
thank you ! There is another problem.
[Exception|implicit_gemm_pair]indices=torch.Size([720000, 4]),bs=4,ss=[41, 1504, 1504],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],st ride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attache d in a issue.
This error may be caused by the incorrect input shape for spconv, you can check the input shape of spconv layer (Note: the input data's channel of the first spconv layer should be 6 for multi-frame).
You can reduce the num_of_worker to avoid this problem.
Hello,Can you explain why this error is related to the num_of_worker ?
Hi, this error is caused by OOM of CPU, too much workers will cost large memory for multi-frame setting.
You can reduce the num_of_worker to avoid this problem.
thank you ! There is another problem.
[Exception|implicit_gemm_pair]indices=torch.Size([720000, 4]),bs=4,ss=[41, 1504, 1504],algo=ConvAlgo.MaskImplicitGemm,ksize=[3, 3, 3],st ride=[1, 1, 1],padding=[1, 1, 1],dilation=[1, 1, 1],subm=True,transpose=False SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attache d in a issue.This error may be caused by the incorrect input shape for spconv, you can check the input shape of spconv layer (Note: the input data's channel of the first spconv layer should be 6 for multi-frame).
Thank you I find this error is from multi-GPUs training. It works when I set NUM_GPUS=1 "bash scripts/dist_train.sh ${NUM_GPUS} --cfg_file cfgs/waymo_models/centerpoint_4frames.yaml"
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.