mmpretrain icon indicating copy to clipboard operation
mmpretrain copied to clipboard

why for swin-transformer inference, fp16 is slower than fp32 on Nvidia GPU

Open yige2018 opened this issue 3 years ago • 4 comments

推荐使用英语模板 General question,以便你的问题帮助更多人。

首先确认以下内容

  • 我已经查询了相关的 issue,但没有找到需要的帮助。
  • 我已经阅读了相关文档,但仍不知道如何解决。

描述你遇到的问题

I used below configuration/model 'configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py' 'swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth' I modified image_demo.py to below

from argparse import ArgumentParser

from mmcls.apis import inference_model, init_model, show_result_pyplot

import time

from mmcv.runner.fp16_utils import wrap_fp16_model

def main(): parser = ArgumentParser() parser.add_argument('img', help='Image file') parser.add_argument('config', help='Config file') parser.add_argument('checkpoint', help='Checkpoint file') parser.add_argument( '--device', default='cuda:0', help='Device used for inference') args = parser.parse_args()

# build the model from a config file and a checkpoint file
model = init_model(args.config, args.checkpoint, device=args.device)
wrap_fp16_model(model)

warms = 10
# test a single image
for i in range(warms):
    result = inference_model(model, args.img)

start = time.time()
for i in range(1000):
    result = inference_model(model, args.img)
end = time.time()

print('latency is {}'.format(end-start))
# show the results
#show_result_pyplot(model, args.img, result)

if name == 'main': main()

in the mmcls/apis/inference.py, I add one line

if next(model.parameters()).is_cuda:
    **data['img'] = data['img'].half()**
    # scatter to specified GPU
    data = scatter(data, [device])[0]

below is result running for fp16 (open-mmlab) :~/mmclassification/demo$ python image_demo.py demo_224.jpg '../configs/swin_transformer/swin_base_224_b16x64_300e_imagenet.py' '../checkpoints/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth' --device cuda load checkpoint from local path: ../checkpoints/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth latency is 38.851526737213135

for fp32, I didn't change any code for mmcls/apis/inference.py, for image_demo.py, just comment out wrap_fp16_model(model)

Here is the log for fp32: load checkpoint from local path: ../checkpoints/swin_base_224_b16x64_300e_imagenet_20210616_190742-93230b0d.pth latency is 35.35148549079895

there are fp32 is better than fp16. Anything wrong here, What is good way to run fp16 inference on this model?

[填写这里]

相关信息

  1. pip list | grep "mmcv\|mmcls\|^torch" 命令的输出 [填写这里]
  2. 如果你修改了,或者使用了新的配置文件,请在这里写明
[填写这里]
  1. 如果你是在训练过程中遇到的问题,请填写完整的训练日志和报错信息 [填写这里]

  2. 如果你对 mmcls 文件夹下的代码做了其他相关的修改,请在这里写明 [填写这里] in the mmcls/apis/inference.py, I add one line

    if next(model.parameters()).is_cuda: data['img'] = data['img'].half() # scatter to specified GPU data = scatter(data, [device])[0]

yige2018 avatar May 10 '22 23:05 yige2018

@Ezra-Yu it is A10

yige2018 avatar May 11 '22 03:05 yige2018

Please try to set

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

At the beginning of the script and test again on FP32.

mzr1996 avatar May 11 '22 03:05 mzr1996

@mzr1996 Thanks! Definitely I will try.

yige2018 avatar May 11 '22 04:05 yige2018

@mzr1996 I put it the beginning of the script, I don't see any performance change for FP32 Here is the location for these two lines, Could you help review whether I run fp16 above code correctly? Here is the partial code: from argparse import ArgumentParser

from mmcls.apis import inference_model, init_model, show_result_pyplot

import time import torch from mmcv.runner.fp16_utils import wrap_fp16_model

torch.backends.cuda.matmul.allow_tf32 = False torch.backends.cudnn.allow_tf32 = False

yige2018 avatar May 11 '22 15:05 yige2018

This issue will be closed as it is inactive, feel free to re-open it if necessary.

tonysy avatar Dec 12 '22 15:12 tonysy