RoadNet About training time

Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal?

BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?

Mar 14 '24 12:03 ZJWang9928

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

Mar 18 '24 17:03 VictorLlu

The checkpoint is the pretraining checkpoint, Please refine to issue https://github.com/fudan-zvg/RoadNet/issues/2#issuecomment-2004486882

Mar 18 '24 17:03 VictorLlu

https://github.com/fudan-zvg/RoadNet/blob/9a83cf6aa09896e6df6c36c3a534e9b9ab075a7b/RoadNetwork/rntr/init.py#L24C1-L24C52

Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it? from .data import nuscenes_converter_pon_centerline #5

Mar 21 '24 08:03 ZJWang9928

@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + $\Delta$. Is this phenomenon abnormal? NUM_GPUs = 1:

NUM_GPUs = 2:

NUM_GPUs = 4:

NUM_GPUs = 8:

Mar 22 '24 11:03 ZJWang9928

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Mar 26 '24 03:03 ZJWang9928

I've made a minor modification to the image loading process:

img_bytes = [
    get(name, backend_args=self.backend_args) for name in filename
]
img = [
    mmcv.imfrombytes(img_byte, flag=self.color_type)
    for img_byte in img_bytes
]
img = np.stack(img, axis=-1)

This approach replaces the use of mmcv.imread. It has provided some improvement, yet the loading time remains significantly long.

Apr 03 '24 15:04 VictorLlu

I find it highly related to the num_workers

I've noticed that the delay between iterations directly corresponds to the num_workers setting in multi-GPU training scenarios. Despite eliminating every time-consuming element in the dataloader, it still experiences delays at intervals consistent with the num_workers count. This suggests that the issue might stem from mmdetection3d rather than the dataloader itself.

Apr 03 '24 17:04 VictorLlu

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Apr 03 '24 20:04 VictorLlu

When I use a single 2080GPU, it takes 59 days to complete the training......

Apr 12 '24 03:04 EchoQiHeng

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Apr 15 '24 02:04 ZJWang9928

Hi, I have found a solution in MMdetection issues https://github.com/open-mmlab/mmdetection/issues/11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Jun 17 '24 06:06 raimberri

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?

Jun 19 '24 03:06 y352184741

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

Jun 19 '24 06:06 raimberri

Hi~ Have you finished the training and successfully reproduced the results from the paper?

Jun 25 '24 03:06 y352184741

hello，did you reproduce the results in paper?

Jun 28 '24 15:06 wangpinzhi

FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution. Screenshot from 2024-07-30 17-13-35

Jul 30 '24 09:07 raimberri

Hello, may I ask which script you used to train and get this result? I tried all the scripts provided by the author, only the AAAI version can run successfully, the others all have bugs.

Sep 25 '24 12:09 cxy1996

RoadNet RoadNet copied to clipboard

About training time

RoadNet
RoadNet copied to clipboard