mmdetection3d icon indicating copy to clipboard operation
mmdetection3d copied to clipboard

[Bug] nuscenes数据集评估超时

Open AndrewJSong opened this issue 1 year ago • 3 comments

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

mmcv 2.0.0rc4 mmdet 3.0.0rc5 mmdet3d 1.3.0 mmengine 0.9.1

Reproduces the problem - code sample

None

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=0,2 bash tools/dist_test.sh config.py checkpoint.pth 2

Reproduces the problem - error message

Formating bboxes of pred_instances_3d Start to convert detection format... [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6019/6019, 4.3 task/s, elapsed: 1390s, ETA: 0s Results writes to /tmp/tmpspsqlq2j/results/pred_instances_3d/results_nusc.json Evaluating bboxes of pred_instances_3d 46%|███████████████████████████████████████████████████████▋ | 2749/6019 [00:07<00:09, 333.98it/s][E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804112 milliseconds before timing out. Traceback (most recent call last): File "tools/test.py", line 149, in main() File "tools/test.py", line 145, in main runner.test() File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 1823, in test metrics = self.test_loop.run() # type: ignore File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/loops.py", line 438, in run metrics = self.evaluator.evaluate(len(self.dataloader.dataset)) File "/opt/conda/lib/python3.7/site-packages/mmengine/evaluator/evaluator.py", line 79, in evaluate _results = metric.evaluate(size) File "/opt/conda/lib/python3.7/site-packages/mmengine/evaluator/metric.py", line 144, in evaluate broadcast_object_list(metrics) File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/dist.py", line 519, in broadcast_object_list torch_dist.broadcast_object_list(data, src, group) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1733, in broadcast_object_list broadcast(object_tensor, src=src, group=group) File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL communicator was aborted on rank 1. 52%|███████████████████████████████████████████████████████████████▋ | 3145/6019 [00:08<00:07, 381.26it/s][E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804112 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 11079) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:

Additional information

nuscenes v1.0-trainval数据集的3D目标检测评估,存在timeout问题。

我的理解是推理完成之后还在占用GPU,而评估计算太慢,超过1800秒时间上限,导致timeout报错。 能否优化一下硬件资源流程,在推理结束后释放GPU资源,评估计算在CPU上完成,避免超时异常退出。

当前解决方案: dist_cfg=dict(backend='nccl',timeout=3600), 手动增加timeout时间

AndrewJSong avatar Nov 16 '23 01:11 AndrewJSong

请问是直接添加在model的config文件里面么

Hgsil avatar Dec 06 '23 03:12 Hgsil

请问是直接添加在model的config文件里面么

是的

AndrewJSong avatar Dec 06 '23 04:12 AndrewJSong

请问是直接添加在model的config文件里面么

是的

好的,谢谢,刚刚发现之前修改的被默认文件覆盖导致设置失败。

Hgsil avatar Dec 06 '23 08:12 Hgsil