insightface icon indicating copy to clipboard operation
insightface copied to clipboard

Distributed validation fails

Open MohamedA95 opened this issue 2 years ago • 0 comments

Hello everyone, I am facing an issue with train.py when --launcher is set to pytorch the validation step fails with the following stack trace:

  File "test.py", line 208, in <module>
    main()
  File "test.py", line 185, in main
    args.gpu_collect)
  File "/home/blablabla/task3/scrfd/detection/scrfd/mmdet/apis/test.py", line 97, in multi_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/blablabla/anaconda3/envs/scrfd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/blablabla/anaconda3/envs/scrfd/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/blablabla/anaconda3/envs/scrfd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/blablabla/task3/scrfd/detection/scrfd/mmcv/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/home/blablabla/task3/scrfd/detection/scrfd/mmdet/models/detectors/base.py", line 182, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/blablabla/task3/scrfd/detection/scrfd/mmdet/models/detectors/base.py", line 149, in forward_test
    img_meta[img_id]['batch_input_shape'] = tuple(img.size()[-2:])
TypeError: 'DataContainer' object is not subscriptable

I used test.py for debugging to save the training time. After some debugging, I noticed that forward_test() in mmdet/models/detectors/base.py expects img_metas to be (List[List[dict]]) but under distributed training it's a (List[DataContainer(List[List[dict()]])]) check below

        print('\ntype img_metas:', type(img_metas))
        print('type img_metas[0]:', type(img_metas[0]))
        print('type img_metas[0].data:', type(img_metas[0].data))
        print('type img_metas[0].data[0]:', type(img_metas[0].data[0]))
        print('type img_metas[0].data[0][0]:', type(img_metas[0].data[0][0]))
        exit()
#Output:
type img_metas: <class 'list'>
type img_metas[0]: <class 'mmcv.parallel.data_container.DataContainer'>
type img_metas[0].data: <class 'list'>
type img_metas[0].data[0]: <class 'list'>
type img_metas[0].data[0][0]: <class 'dict'>

Relaunching with --launcher set to none works normally. Also modifying the code to deal with the DataContainer works but is there a solution that works for both distributed and non distributed flow?

MohamedA95 avatar May 09 '22 17:05 MohamedA95