CenterPoint Get stuck after one epoch of training (Multi GPU DDP) See this!

Get stuck after one epoch of training (Multi GPU DDP) See this!

Open wusize opened this issue 2 years ago • 33 comments

When training with multiple gpus, the programme stops at "INFO - finding looplift candidates" after one epoch of training? This info sentence might come from numba, but I am not able to exactly locate it. Is there anyone who meets the same problem?

See https://github.com/tianweiy/CenterPoint/issues/203#issuecomment-1133801242 for solution

Oct 07 '21 08:10 wusize

I also met this problem and I wonder have you fixed it yet?

Dec 01 '21 10:12 Charrrrrlie

Nope. Maybe you can try mmdetection3d.

Dec 03 '21 07:12 wusize

I also have no clue (my school server only gets cuda 10.0 so I am still using torch 1.1.0 for training and there is no issue in that version). I feel the problem is related to apex according to a few other recent issues and you may want to replace Apex syncbn with torch's sync bn. Will check this more after I finish this semester ( in two weeks)

Dec 03 '21 07:12 tianweiy

I think I fix it tonight with 'cuda 10.1, torch=1.4.0, numba=0.53.1'. And it's worth mentioning that iou3d_nms module can't be directly set up from the .sh file since it changed the root path(?).

Thanks again for the prompt replies @wusize @tianweiy !!!

Dec 03 '21 13:12 Charrrrrlie

I also encounter this problem.

How to fix it ??

Dec 18 '21 08:12 kagecom

I also encounter this problem.

How to fix it ??

I think numba version may be incompatible with other dependencies. You can try my version plan above or carefully follow the author's instructions for each package.

Dec 18 '21 08:12 Charrrrrlie

But there is no cuda 10.1, torch=1.4.0, i.e. torch=1.4.0+cu101

Dec 18 '21 09:12 kagecom

It's said that "PyTorch 1.4.0 shipped with CUDA 10.1 by default, so there is no separate package with the cu101 suffix, those are only for alternative versions. " And I suggest conda to install it. You can find the command in pytorch.org

Dec 18 '21 09:12 Charrrrrlie

The problem is related to ddp in recent torch versions.

It should be fixed now https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2

And you should be able to use most recent torch versions. Let me know if there are any further problems

Dec 19 '21 00:12 tianweiy

I am still stuck in the training process after merge the last two commits https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2 and https://github.com/tianweiy/CenterPoint/commit/a32fb02723011c84e500e16991b7ede43c8b5097.

My environment is torch 1.7.0+cu101, V100-SXM2 16G.

Dec 19 '21 03:12 kagecom

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run

python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py

it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

Dec 19 '21 04:12 tianweiy

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run
python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py
it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

Hi there, i use waymo dataset and i don't know the differences in your debug setting.

But i test load_interval=1000 training on waymo dataset, the stuck disappear.

I don't know why.

Dec 19 '21 14:12 kagecom

Got it. Yeah, I only change the interval to subsample dataset.

Uhm, weird then. Maybe just use 1.4 if it is fine for your case. I will look into this further

Dec 19 '21 14:12 tianweiy

When load_interval = 5, it stuck, load_interval = 1000, it woks.

It confuses me.

Dec 19 '21 15:12 kagecom

thanks, another thing you can try is

changing

https://github.com/tianweiy/CenterPoint/blob/3fd0b8745b77575cb9810035aafc76796613f942/det3d/torchie/apis/train.py#L268

        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

Dec 19 '21 15:12 tianweiy

I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why.

Dec 26 '21 11:12 kagecom

maybe related to multiprocessing, like adding a line like this to train.py before init ddp

    if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')

no clue if this works or not though because I just could not reproduce your error

Jan 03 '22 02:01 tianweiy

I have the same problem ,have you solved it ? I guess is it related to num_workers?

Apr 23 '22 09:04 zzm-hl

I have the same problem！！！！

May 14 '22 04:05 Liaoqing-up

I have the same problem ,have you solved it ? I guess is it related to num_workers?

Do you have any ideas

May 14 '22 04:05 Liaoqing-up

hello, i'am confused about the effect of load_interval.... Can you explain what does this parameter mean?

May 14 '22 04:05 Liaoqing-up

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see https://github.com/tianweiy/CenterPoint/issues/314

May 14 '22 06:05 tianweiy

thanks, another thing you can try is

changing

https://github.com/tianweiy/CenterPoint/blob/3fd0b8745b77575cb9810035aafc76796613f942/det3d/torchie/apis/train.py#L268

to
        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

i still stuck by way 1 .....TAT

May 16 '22 01:05 Liaoqing-up

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see #314

i guess the problem come from 'finding looplift candidates', i always stuck in this step after train an epoch, do you know what that means?

May 16 '22 01:05 Liaoqing-up

it has nothing to do with finding looplift candidates.

It is the byproduct of starting a new epoch.

Unfortunately, I don't know what the root cause is (someone gets this issue and some one doesn;t...)

May 16 '22 02:05 tianweiy

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

May 16 '22 02:05 Liaoqing-up

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

May 16 '22 02:05 kagecom

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

May 16 '22 02:05 Liaoqing-up

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

May 16 '22 02:05 Liaoqing-up

NCCL_BLOCKING_WAIT=1

hello, i come back~ i have tried this way for train recently, it seems no stuck anymore, and the speed seems normal

May 22 '22 02:05 Liaoqing-up

CenterPoint CenterPoint copied to clipboard

Get stuck after one epoch of training (Multi GPU DDP) See this!

CenterPoint
CenterPoint copied to clipboard