RoadNet
RoadNet copied to clipboard
About training time
Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal?
BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?
Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.
The checkpoint is the pretraining checkpoint, Please refine to issue https://github.com/fudan-zvg/RoadNet/issues/2#issuecomment-2004486882
Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.
Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it?
from .data import nuscenes_converter_pon_centerline
#5
Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.
@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + $\Delta$. Is this phenomenon abnormal?
NUM_GPUs = 1:
NUM_GPUs = 2:
NUM_GPUs = 4:
NUM_GPUs = 8:
@VictorLlu
FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.
BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.
I've made a minor modification to the image loading process:
img_bytes = [
get(name, backend_args=self.backend_args) for name in filename
]
img = [
mmcv.imfrombytes(img_byte, flag=self.color_type)
for img_byte in img_bytes
]
img = np.stack(img, axis=-1)
This approach replaces the use of mmcv.imread
. It has provided some improvement, yet the loading time remains significantly long.
I find it highly related to the num_workers
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.
I've noticed that the delay between iterations directly corresponds to the num_workers
setting in multi-GPU training scenarios. Despite eliminating every time-consuming element in the dataloader, it still experiences delays at intervals consistent with the num_workers
count. This suggests that the issue might stem from mmdetection3d rather than the dataloader itself.
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.
Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
When I use a single 2080GPU, it takes 59 days to complete the training......
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
Hi, I have found a solution in MMdetection issues https://github.com/open-mmlab/mmdetection/issues/11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
Hopefully this works for you.
BTW, I used latest pytorch-2.3.1 by
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
with other environment unchanged.
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
![]()
Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
with other environment unchanged.
Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
![]()
Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
with other environment unchanged.Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?
I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training.
original hyperparameter batch_size=2, lr_rate=2e-4 log
modified hyperparameter batch_size=4, lr_rate=1e-4 log
However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
![]()
Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
with other environment unchanged.Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?
![]()
I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log
modified hyperparameter batch_size=4, lr_rate=1e-4 log
However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.
Hi~ Have you finished the training and successfully reproduced the results from the paper?
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
![]()
Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
with other environment unchanged.Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?
![]()
I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log
modified hyperparameter batch_size=4, lr_rate=1e-4 log
However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.
hello,did you reproduce the results in paper?
@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in
mmengine/model/wrappers/distributed.py
when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.BTW, could you please update the
nuscenes_converter_pon_centerline
file recently? It would be of great help! Thank you.Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.
Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
![]()
Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
with other environment unchanged.Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?
![]()
I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log
modified hyperparameter batch_size=4, lr_rate=1e-4 log
However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.
hello,did you reproduce the results in paper?
FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution.
@VictorLlu仅供参考,我打印了使用 8 个 GPU 进行训练时的数据预处理 (time1)、前向传播 (time2) 和反向传播 (time3) 的时间
mmengine/model/wrappers/distributed.py
。看来平均批处理时间较长的主要原因是某些迭代中的反向传播速度极慢。顺便问一下,您最近可以更新
nuscenes_converter_pon_centerline
文件吗?那会很有帮助!谢谢。以下是完善后的版本: 此问题已在其他型号的 mmdetection3d 中发现,表明这可能是此版本固有的问题。我将在未来几天内推出 mmdetection 0.17.1 版本。
你好!已经过去两周了,你什么时候更新 mmdetection 0.17.0 版本?这会带来很大帮助。
嗨,我在 MMdetection 问题open-mmlab/mmdetection#11503中找到了解决方案,即更新您的 pytorch 版本 >= 2.2。我已经测试过它,它成功地将训练时间从 25 天缩短到 4 天,希望这对您有用。顺便说一句,我使用了最新的 pytorch-2.3.1,其他环境保持不变。
![]()
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
你好,我已将 pytorch 更新到最新版本,并成功缩短了训练时间。但是,经过某些迭代后,grads 变为 NaN,loss 变为 0。你在训练期间遇到过这个问题吗?
![]()
我遇到了同样的问题,我尝试过的方法是将 batch_size 从 2 改为 4,将 lr_rate 从 2e-4 改为 1e-4,然后问题消失,模型可以正常训练。 原始超参数 batch_size=2,lr_rate=2e-4 log
修改后的超参数 batch_size=4,lr_rate=1e-4 log
但是,由于模型尚未完成训练,所以我没有深入研究它,所以我只能提供一个粗略的假设,即该问题是由某些异常的数据输入引起的/gt,扩大 batch_size 可能会减轻影响。
你好,你在论文中复现了结果吗?
仅供参考,我确实有一些结果,不太好,如下所示。由于模型收敛得不是很好(可能是由于超参数设置和有限的 GPU 资源造成的),而且我没有花太多时间对其进行优化和实现精心设计的可视化脚本,因此等待官方发布的模型权重和可视化脚本将是最终的解决方案。
Hello, may I ask which script you used to train and get this result? I tried all the scripts provided by the author, only the AAAI version can run successfully, the others all have bugs.