CenterPoint icon indicating copy to clipboard operation
CenterPoint copied to clipboard

Get stuck after one epoch of training (Multi GPU DDP) See this!

Open wusize opened this issue 2 years ago • 33 comments

When training with multiple gpus, the programme stops at "INFO - finding looplift candidates" after one epoch of training? This info sentence might come from numba, but I am not able to exactly locate it. Is there anyone who meets the same problem?

See https://github.com/tianweiy/CenterPoint/issues/203#issuecomment-1133801242 for solution

wusize avatar Oct 07 '21 08:10 wusize

I also met this problem and I wonder have you fixed it yet?

Charrrrrlie avatar Dec 01 '21 10:12 Charrrrrlie

Nope. Maybe you can try mmdetection3d.

wusize avatar Dec 03 '21 07:12 wusize

I also have no clue (my school server only gets cuda 10.0 so I am still using torch 1.1.0 for training and there is no issue in that version). I feel the problem is related to apex according to a few other recent issues and you may want to replace Apex syncbn with torch's sync bn. Will check this more after I finish this semester ( in two weeks)

tianweiy avatar Dec 03 '21 07:12 tianweiy

I think I fix it tonight with 'cuda 10.1, torch=1.4.0, numba=0.53.1'. And it's worth mentioning that iou3d_nms module can't be directly set up from the .sh file since it changed the root path(?).

Thanks again for the prompt replies @wusize @tianweiy !!!

Charrrrrlie avatar Dec 03 '21 13:12 Charrrrrlie

I also encounter this problem.

How to fix it ??

kagecom avatar Dec 18 '21 08:12 kagecom

I also encounter this problem.

How to fix it ??

I think numba version may be incompatible with other dependencies. You can try my version plan above or carefully follow the author's instructions for each package.

Charrrrrlie avatar Dec 18 '21 08:12 Charrrrrlie

But there is no cuda 10.1, torch=1.4.0, i.e. torch=1.4.0+cu101

kagecom avatar Dec 18 '21 09:12 kagecom

It's said that "PyTorch 1.4.0 shipped with CUDA 10.1 by default, so there is no separate package with the cu101 suffix, those are only for alternative versions. " And I suggest conda to install it. You can find the command in pytorch.org

Charrrrrlie avatar Dec 18 '21 09:12 Charrrrrlie

The problem is related to ddp in recent torch versions.

It should be fixed now https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2

And you should be able to use most recent torch versions. Let me know if there are any further problems

tianweiy avatar Dec 19 '21 00:12 tianweiy

I am still stuck in the training process after merge the last two commits https://github.com/tianweiy/CenterPoint/commit/e30f768a36427029b1fa055563583aafd9b58db2 and https://github.com/tianweiy/CenterPoint/commit/a32fb02723011c84e500e16991b7ede43c8b5097.

My environment is torch 1.7.0+cu101, V100-SXM2 16G.

kagecom avatar Dec 19 '21 03:12 kagecom

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run

python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py

it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

tianweiy avatar Dec 19 '21 04:12 tianweiy

oh, interesting, do you get timeout error ? Because I also noticed a slightly large delay between epochs, but it does proceed after some time.

Could you try a simple example? Basically, I push a new cfg just now to simulate the training process.

Could you run

python -m torch.distributed.launch --nproc_per_node 2 tools/train.py configs/mvp/nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_scale_debug.py

it will only take a minute or so for one epoch. I want to know if you still get stuck with this cfg ?

Hi there, i use waymo dataset and i don't know the differences in your debug setting.

But i test load_interval=1000 training on waymo dataset, the stuck disappear.

I don't know why.

kagecom avatar Dec 19 '21 14:12 kagecom

Got it. Yeah, I only change the interval to subsample dataset.

Uhm, weird then. Maybe just use 1.4 if it is fine for your case. I will look into this further

tianweiy avatar Dec 19 '21 14:12 tianweiy

When load_interval = 5, it stuck, load_interval = 1000, it woks.

It confuses me.

kagecom avatar Dec 19 '21 15:12 kagecom

thanks, another thing you can try is

  1. changing

https://github.com/tianweiy/CenterPoint/blob/3fd0b8745b77575cb9810035aafc76796613f942/det3d/torchie/apis/train.py#L268

to

        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
  1. use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

tianweiy avatar Dec 19 '21 15:12 tianweiy

I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why.

kagecom avatar Dec 26 '21 11:12 kagecom

maybe related to multiprocessing, like adding a line like this to train.py before init ddp

    if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')

no clue if this works or not though because I just could not reproduce your error

tianweiy avatar Jan 03 '22 02:01 tianweiy

I have the same problem ,have you solved it ? I guess is it related to num_workers?

zzm-hl avatar Apr 23 '22 09:04 zzm-hl

I have the same problem!!!!

Liaoqing-up avatar May 14 '22 04:05 Liaoqing-up

I have the same problem ,have you solved it ? I guess is it related to num_workers?

Do you have any ideas

Liaoqing-up avatar May 14 '22 04:05 Liaoqing-up

hello, i'am confused about the effect of load_interval.... Can you explain what does this parameter mean?

Liaoqing-up avatar May 14 '22 04:05 Liaoqing-up

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see https://github.com/tianweiy/CenterPoint/issues/314

tianweiy avatar May 14 '22 06:05 tianweiy

thanks, another thing you can try is

  1. changing

https://github.com/tianweiy/CenterPoint/blob/3fd0b8745b77575cb9810035aafc76796613f942/det3d/torchie/apis/train.py#L268

to

        model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)
  1. use spconv2.x

I am now running a few experiments to see if there are any performance differences due to these two changes and will update soon. (Update: results with spconv 2.x + torch nn syncbn is similar to the original version.)

I am able to train the full nuScenes dataset with 8 GPUs ddp (titan v) and latest torch (1.10.1 version + CUDA 11.3).

i still stuck by way 1 .....TAT

Liaoqing-up avatar May 16 '22 01:05 Liaoqing-up

load_interval probably is not the root cause. It defines how we subsample the dataset (10 then we use 1/10 of the dataset).

Unfortunately, I am not able to reproduce this issue ... Also see #314

i guess the problem come from 'finding looplift candidates', i always stuck in this step after train an epoch, do you know what that means?

Liaoqing-up avatar May 16 '22 01:05 Liaoqing-up

it has nothing to do with finding looplift candidates.

It is the byproduct of starting a new epoch.

Unfortunately, I don't know what the root cause is (someone gets this issue and some one doesn;t...)

tianweiy avatar May 16 '22 02:05 tianweiy

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

Liaoqing-up avatar May 16 '22 02:05 Liaoqing-up

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

kagecom avatar May 16 '22 02:05 kagecom

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

Liaoqing-up avatar May 16 '22 02:05 Liaoqing-up

i tried to change the load_interval from 1 to 100 just now, and seems to no stuck.

i have try several ways including change the load_interval as mentioned above this issue.

i suggest you to try

"I have tried several couples of combinations but still stuck.

The only way work for me is to add env NCCL_BLOCKING_WAIT=1 to start the training process.

However, it slows the training process and i don't know why. "

ok, i'll try, thankyou~

Liaoqing-up avatar May 16 '22 02:05 Liaoqing-up

NCCL_BLOCKING_WAIT=1

hello, i come back~ i have tried this way for train recently, it seems no stuck anymore, and the speed seems normal

Liaoqing-up avatar May 22 '22 02:05 Liaoqing-up