insightface
insightface copied to clipboard
Distributed validation fails
Hello everyone,
I am facing an issue with train.py
when --launcher
is set to pytorch
the validation step fails with the following stack trace:
File "test.py", line 208, in <module>
main()
File "test.py", line 185, in main
args.gpu_collect)
File "/home/blablabla/task3/scrfd/detection/scrfd/mmdet/apis/test.py", line 97, in multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/home/blablabla/anaconda3/envs/scrfd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/blablabla/anaconda3/envs/scrfd/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/blablabla/anaconda3/envs/scrfd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/blablabla/task3/scrfd/detection/scrfd/mmcv/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home/blablabla/task3/scrfd/detection/scrfd/mmdet/models/detectors/base.py", line 182, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/home/blablabla/task3/scrfd/detection/scrfd/mmdet/models/detectors/base.py", line 149, in forward_test
img_meta[img_id]['batch_input_shape'] = tuple(img.size()[-2:])
TypeError: 'DataContainer' object is not subscriptable
I used test.py
for debugging to save the training time. After some debugging, I noticed that forward_test()
in mmdet/models/detectors/base.py expects img_metas
to be (List[List[dict]])
but under distributed training it's a (List[DataContainer(List[List[dict()]])])
check below
print('\ntype img_metas:', type(img_metas))
print('type img_metas[0]:', type(img_metas[0]))
print('type img_metas[0].data:', type(img_metas[0].data))
print('type img_metas[0].data[0]:', type(img_metas[0].data[0]))
print('type img_metas[0].data[0][0]:', type(img_metas[0].data[0][0]))
exit()
#Output:
type img_metas: <class 'list'>
type img_metas[0]: <class 'mmcv.parallel.data_container.DataContainer'>
type img_metas[0].data: <class 'list'>
type img_metas[0].data[0]: <class 'list'>
type img_metas[0].data[0][0]: <class 'dict'>
Relaunching with --launcher
set to none
works normally. Also modifying the code to deal with the DataContainer
works but is there a solution that works for both distributed and non distributed flow?