CenterPoint icon indicating copy to clipboard operation
CenterPoint copied to clipboard

multi gpus run error after 1 epoch

Open zzm-hl opened this issue 2 years ago • 12 comments

`2022-04-24 09:26:36,204 - INFO - Epoch [1/20][3860/3862] lr: 0.00013, eta: 2 days, 11:49:01, time: 3.606, data_time: 0.654, transfer_time: 0.165, forward_time: 1.555, loss_parse_time: 0.002 memory: 27917, 2022-04-24 09:26:36,239 - INFO - task : ['car'], loss: 1.5316, hm_loss: 1.0401, loc_loss: 1.9661, loc_loss_elem: ['0.1882', '0.1916', '0.2169', '0.0773', '0.0694', '0.0938', '0.5669', '0.8691', '0.4311', '0.4106'], num_positive: 209.6000 2022-04-24 09:26:36,239 - INFO - task : ['truck', 'construction_vehicle'], loss: 2.2867, hm_loss: 1.6434, loc_loss: 2.5733, loc_loss_elem: ['0.2141', '0.2157', '0.3811', '0.1663', '0.1680', '0.1678', '0.3224', '0.5616', '0.5250', '0.5585'], num_positive: 126.8000 2022-04-24 09:26:36,240 - INFO - task : ['bus', 'trailer'], loss: 2.2827, hm_loss: 1.5935, loc_loss: 2.7572, loc_loss_elem: ['0.2271', '0.2145', '0.4656', '0.1020', '0.1414', '0.1205', '0.8358', '1.2225', '0.5222', '0.5521'], num_positive: 90.8000 2022-04-24 09:26:36,240 - INFO - task : ['barrier'], loss: 1.6930, hm_loss: 1.1514, loc_loss: 2.1661, loc_loss_elem: ['0.1637', '0.1844', '0.1940', '0.1647', '0.2657', '0.1325', '0.0365', '0.0479', '0.5859', '0.4582'], num_positive: 93.8000 2022-04-24 09:26:36,241 - INFO - task : ['motorcycle', 'bicycle'], loss: 1.5902, hm_loss: 1.0459, loc_loss: 2.1772, loc_loss_elem: ['0.1616', '0.1612', '0.1969', '0.1860', '0.1172', '0.1472', '0.4495', '0.6558', '0.4626', '0.5233'], num_positive: 160.8000 2022-04-24 09:26:36,241 - INFO - task : ['pedestrian', 'traffic_cone'], loss: 1.5811, hm_loss: 0.9833, loc_loss: 2.3910, loc_loss_elem: ['0.1535', '0.1580', '0.2172', '0.2156', '0.2590', '0.1522', '0.2787', '0.3207', '0.5800', '0.5356'], num_positive: 144.2000

2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates 2022-04-24 09:27:23,734 - INFO - finding looplift candidates [E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=583940, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out. /public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 58033 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 58028) of binary: /public/home/u212040344/.conda/envs/centerpoint/bin/python Traceback (most recent call last): File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in main() File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(*cmd_args) File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/public/home/u212040344/.conda/envs/centerpoint/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-04-24_09:59:14 host : node191 rank : 0 (local_rank: 0) exitcode : -6 (pid: 58028) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 58028

` run on 2 gpus A100 with batch_size 16 num_workers 8*2 enviroment: pytorch1.11 cuda11.3 spconv2.x Could you help me ? thank you

zzm-hl avatar Apr 24 '22 03:04 zzm-hl

see https://github.com/tianweiy/CenterPoint/issues/224#issuecomment-986228097 and https://github.com/tianweiy/CenterPoint/issues/203

Unfortunately, I don't have any more suggestions about how to fix this as I can't reproduce this error in my setup

tianweiy avatar Apr 24 '22 03:04 tianweiy

maybe also check out https://github.com/open-mmlab/OpenPCDet/issues/696

Please let me know if you find any of these work and I will update the code accordingly

tianweiy avatar Apr 24 '22 03:04 tianweiy

the same problem in https://www.zhihu.com/question/512132168 https://discuss.pytorch.org/t/gpu-startup-is-way-too-slow/147956/12

I think it need to close the IOMMU, Unfortunately, GPU is on the cluster of the school, I do not have the permission to try to shut down IOMMU through BIOS, so this problem still exists. But I tried, and it is normal to train on a single GPU, but the speed is relatively slow

zzm-hl avatar Apr 25 '22 17:04 zzm-hl

do you still use apex or the native syncbn ?

check https://github.com/tianweiy/CenterPoint/issues/224#issuecomment-986228097 (they seem to fix the problem by switching.

tianweiy avatar Apr 25 '22 18:04 tianweiy

I have use sync instead .and I have decreased the number workers and batchsize ,but it dosn't work too.

---Original--- From: "Tianwei @.> Date: Tue, Apr 26, 2022 02:51 AM To: @.>; Cc: @.@.>; Subject: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue#314)

do you still use apex or the native syncbn ?

check #224 (comment) (they seem to fix the problem by switching.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

zzm-hl avatar Apr 26 '22 00:04 zzm-hl

i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.

tianweiy avatar Apr 26 '22 03:04 tianweiy

worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)

tianweiy avatar Apr 26 '22 03:04 tianweiy

worse case scenario, you can just use openpcdet's centerpoint implementation (they also look good and have comparable performance)

thank you ,I will have a try!

zzm-hl avatar Apr 30 '22 14:04 zzm-hl

see https://github.com/tianweiy/CenterPoint/issues/203#issuecomment-1133801242

tianweiy avatar May 22 '22 02:05 tianweiy

I checked to tesla A100 does not seem to support CUDA10.X, this is more troublesome, I will try open3d next, thank you very much for your reply!

------------------ 原始邮件 ------------------ 发件人: "tianweiy/CenterPoint" @.>; 发送时间: 2022年4月26日(星期二) 中午11:19 @.>; @.@.>; 主题: Re: [tianweiy/CenterPoint] multi gpus run error after 1 epoch (Issue #314)

i see. Uhm... Maybe you can try the cuda 10.0 + torch 1.1 + spconv 1.x version? I remembered a year ago, no one has issues with multi gpu training. Most problems appear a few months ago with the newer torch and cuda versions.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

zzm-hl avatar Oct 11 '22 09:10 zzm-hl

did you solve the problem? cause i have faced the same problem like you before~ can you share some idea about this nccl Time-out error. btw, my environment is cuda 11.3, spconv 2.1.21, torch 1.12.1,numba 0.57.0. And I got this error too before one iteration starts.

ZecCheng avatar Jul 04 '23 14:07 ZecCheng

假设有两个node,每个node 上有8张GPU,各个node 上的batch size 的值可以设置不一样吗?

JerryDaHeLian avatar Nov 24 '23 07:11 JerryDaHeLian