mmdetection3d
mmdetection3d copied to clipboard
[Bug] nuscenes数据集评估超时
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
mmcv 2.0.0rc4 mmdet 3.0.0rc5 mmdet3d 1.3.0 mmengine 0.9.1
Reproduces the problem - code sample
None
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES=0,2 bash tools/dist_test.sh config.py checkpoint.pth 2
Reproduces the problem - error message
Formating bboxes of pred_instances_3d
Start to convert detection format...
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 4.3 task/s, elapsed: 1390s, ETA: 0s
Results writes to /tmp/tmpspsqlq2j/results/pred_instances_3d/results_nusc.json
Evaluating bboxes of pred_instances_3d
46%|███████████████████████████████████████████████████████▋ | 2749/6019 [00:07<00:09, 333.98it/s][E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804112 milliseconds before timing out.
Traceback (most recent call last):
File "tools/test.py", line 149, in
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
Additional information
nuscenes v1.0-trainval数据集的3D目标检测评估,存在timeout问题。
我的理解是推理完成之后还在占用GPU,而评估计算太慢,超过1800秒时间上限,导致timeout报错。 能否优化一下硬件资源流程,在推理结束后释放GPU资源,评估计算在CPU上完成,避免超时异常退出。
当前解决方案: dist_cfg=dict(backend='nccl',timeout=3600), 手动增加timeout时间
请问是直接添加在model的config文件里面么
请问是直接添加在model的config文件里面么
是的
请问是直接添加在model的config文件里面么
是的
好的,谢谢,刚刚发现之前修改的被默认文件覆盖导致设置失败。