CenterPoint
CenterPoint copied to clipboard
multi gpus run error after 1 epoch
`2022-04-24 09:26:36,204 - INFO - Epoch [1/20][3860/3862] lr: 0.00013, eta: 2 days, 11:49:01, time: 3.606, data_time: 0.654, transfer_time: 0.165, forward_time: 1.555, loss_parse_time: 0.002 memory: 27917, 2022-04-24 09:26:36,239 - INFO - task : ['car'], loss: 1.5316, hm_loss: 1.0401, loc_loss: 1.9661, loc_loss_elem: ['0.1882', '0.1916', '0.2169', '0.0773', '0.0694', '0.0938', '0.5669', '0.8691', '0.4311', '0.4106'], num_positive: 209.6000 2022-04-24 09:26:36,239 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.2867, hm_loss: 1.6434, loc_loss: 2.5733, loc_loss_elem: ['0.2141', '0.2157', '0.3811', '0.1663', '0.1680', '0.1678', '0.3224', '0.5616', '0.5250', '0.5585'], num_positive: 126.8000 2022-04-24 09:26:36,240 - INFO - task : ['bus', 'trailer'], loss: 2.2827, hm_loss: 1.5935, loc_loss: 2.7572, loc_loss_elem: ['0.2271', '0.2145', '0.4656', '0.1020', '0.1414', '0.1205', '0.8358', '1.2225', '0.5222', '0.5521'], num_positive: 90.8000 2022-04-24 09:26:36,240 - INFO - task : ['barrier'], loss: 1.6930, hm_loss: 1.1514, loc_loss: 2.1661, loc_loss_elem: ['0.1637', '0.1844', '0.1940', '0.1647', '0.2657', '0.1325', '0.0365', '0.0479', '0.5859', '0.4582'], num_positive: 93.8000 2022-04-24 09:26:36,241 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.5902, hm_loss: 1.0459, loc_loss: 2.1772, loc_loss_elem: ['0.1616', '0.1612', '0.1969', '0.1860', '0.1172', '0.1472', '0.4495', '0.6558', '0.4626', '0.5233'], num_positive: 160.8000 2022-04-24 09:26:36,241 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.5811, hm_loss: 0.9833, loc_loss: 2.3910, loc_loss_elem: ['0.1535', '0.1580', '0.2172', '0.2156', '0.2590', '0.1522', '0.2787', '0.3207', '0.5800', '0.5356'], num_positive: 144.2000
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
2022-04-24 09:27:23,734 - INFO - finding looplift candidates
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out.
/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58033 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58028) of binary: /public/home/u212040344/.conda/envs/centerpoint/bin/python
Traceback (most recent call last):
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
)(*cmd_args)
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2022-04-24_09:59:14 host : node191 rank : 0 (local_rank: 0) exitcode : -6 (pid: 58028) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58028
` run on 2 gpus A100 with batch_size 16 num_workers 8*2 enviroment: pytorch1.11 cuda11.3 spconv2.x Could you help me ? thank you
see https://github.com/tianweiy/CenterPoint/issues/224#issuecomment-986228097 and https://github.com/tianweiy/CenterPoint/issues/203
Unfortunately, I don't have any more suggestions about how to fix this as I can't reproduce this error in my setup
maybe also check out https://github.com/open-mmlab/OpenPCDet/issues/696
Please let me know if you find any of these work and I will update the code accordingly
the same problem in https://www.zhihu.com/question/512132168 https://discuss.pytorch.org/t/gpu-startup-is-way-too-slow/147956/12
I think it need to close the IOMMU, Unfortunately, GPU is on the cluster of the school, I do not have the permission to try to shut down IOMMU through BIOS, so this problem still exists. But I tried, and it is normal to train on a single GPU, but the speed is relatively slow
do you still use apex or the native syncbn ?
check https://github.com/tianweiy/CenterPoint/issues/224#issuecomment-986228097 (they seem to fix the problem by switching.
I have use sync instead .and I have decreased the number workers and batchsize ,but it dosn't work too.
---Original--- From: "Tianwei @.> Date: Tue, Apr 26, 2022 02:51 AM To: @.>; Cc: @.@.>; Subject: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue#314)
do you still use apex or the native syncbn ?
check #224 (comment) (they seem to fix the problem by switching.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.
worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)
worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)
thank you ,I will have a try!
see https://github.com/tianweiy/CenterPoint/issues/203#issuecomment-1133801242
I checked to tesla A100 does not seem to support CUDA10.X, this is more troublesome, I will try open3d next, thank you very much for your reply!
------------------ 原始邮件 ------------------ 发件人: "tianweiy/CenterPoint" @.>; 发送时间: 2022年4月26日(星期二) 中午11:19 @.>; @.@.>; 主题: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue #314)
i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
did you solve the problem? cause i have faced the same problem like you before~ can you share some idea about this nccl Time-out error. btw, my environment is cuda 11.3, spconv 2.1.21, torch 1.12.1,numba 0.57.0. And I got this error too before one iteration starts.
假设有两个node,每个node 上有8张GPU,各个node 上的batch size 的值可以设置不一样吗?