FasterTransformer The fp16 inference of pytorch swintransformer op got `nan` output.

Description

system and software:

fastertransformer version: v5.0
GPU: T4
Swin-Transformer: e0486b2cf8c63b6314570a43007569c8aa9b4578
CUDA: 11.0

Error Message

got nan of fp16 inference of swintransformer_op: FP16_torch_traced_output vs FP16_op_output , avg diff : nan max diff : nan, which causes by the output of FP16_op_output have nan value after debug into infer_swintransformer_op.py;
got a large amount of CUDA Error messages when fp16 op inference:

CUDA Error: (null) /workdir/xxx/packages/v5.0_tag/FasterTransformer-release-v5.0_tag/3rdparty/trt_fused_multihead_attention/fused_multihead_attention_v2.h 682

Reproduced Steps

1. git clone swin-transformer:

cd examples/pytorch/swin/Swin-Transformer-Quantization && \
  git clone https://github.com/microsoft/Swin-Transformer.git && \
  git checkout e0486b2cf8c63b6314570a43007569c8aa9b4578 && \
  cd ..

2. file path modify:

*inference_swintransformer_op.py*

 25 import sys
 26 sys.path.insert(0, "./Swin-Transformer-Quantization/Swin-Transformer")

*run_test.sh*
 12 python3 infer_swintransformer_op.py \
 13     --eval \
 14     --data-path /workspace \
 15     --cfg Swin-Transformer-Quantization/Swin-Transformer/configs/swin/swin_tiny_patch4_window7_224.yaml \
 16     --resume Swin-Transformer-Quantization/swin_tiny_patch4_window7_224.pth \
 17     --th-path ../../../build/lib/libpyt_swintransformer.so \
 18     --batch-size $1

3: run test shell script: bash run_test.sh 1

Aug 07 '22 08:08 MenglingD

It's seem that the WindowAttention has wrong invoking of FusedMHARunnerFP16v2，and we got expected difference of FP16_op_output and FP16_torch_traced_output after forbidding the use of fused attention:

src/fastertransformer/layers/attention_layers/WindowAttention.cc

  183     if ((sm == 75 || sm == 80 || sm == 86) && size_per_head == 32 && window_len_ <= TRT_MAX_LEN
  184     ¦   && std::is_same<T, half>::value) {
  185     ¦   trt_S = trt_getS(window_len_);
! 186     ¦   //use_trt_ = true;
  187     }

the output of run_test.sh:

FP16 op time :  2.071075439453125 ms
FP16 torch trace time :  8.698122501373291 ms
FP16 torch time :  11.557936668395996 ms
FP32_torch_traced_output vs FP32_op_output , avg diff :  0.00066683866 max diff :  0.0028375983
FP16_torch_traced_output vs FP16_op_output , avg diff :  0.0006795 max diff :  0.003906

Aug 07 '22 08:08 MenglingD

CUDA Error: (null) /workdir/xxx/packages/v5.0_tag/FasterTransformer-release-v5.0_tag/3rdparty/trt_fused_multihead_attention/fused_multihead_attention_v2.h 682

This error means that you don't call fused mha successfully. Can you provide the docker image you use and the building steps you build the project?

Aug 07 '22 23:08 byshiue

CUDA Error: (null) /workdir/xxx/packages/v5.0_tag/FasterTransformer-release-v5.0_tag/3rdparty/trt_fused_multihead_attention/fused_multihead_attention_v2.h 682
This error means that you don't call fused mha successfully. Can you provide the docker image you use and the building steps you build the project?

Thanks for your replying. It's feels like that i uses wrong environment, and I have checked the swin-transformer-op inference in official remmanded image and I got correct result. I builds FasterTransformer in my own docker image, and I will check the build processes and relevant software dependences.

Aug 08 '22 02:08 MenglingD

CUDA Error: (null) /workdir/xxx/packages/v5.0_tag/FasterTransformer-release-v5.0_tag/3rdparty/trt_fused_multihead_attention/fused_multihead_attention_v2.h 682
This error means that you don't call fused mha successfully. Can you provide the docker image you use and the building steps you build the project?
Thanks for your replying. It's feels like that i uses wrong environment, and I have checked the swin-transformer-op inference in official remmanded image and I got correct result. I builds FasterTransformer in my own docker image, and I will check the build processes and relevant software dependences.

I found the main difference between nvcr.io/nvidia/pytorch:21.07-py3 and my own image is the cuda version, which is 11.4 in nvcr.io/nvidia/pytorch:21.07-py3 but 11.0 in my own image. After switching the cuda version to 11.4 in my own image, I got the expected result.

I want to double checking in nvcr.io/nvidia/pytorch:21.07-py3, but the official image doesn't support backward compatibility as pytorch depending on cuda11.4（The only thing I do is that relinking /usr/local/cuda from 11.4 to 11.0 in nvcr.io/nvidia/pytorch:21.07-py3）：

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/torch/__init__.py", line 197, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcupti.so.11.4: cannot open shared object file: No such file or directory
CMake Error at CMakeLists.txt:193 (message):
  PyTorch >= 1.5.0 is needed for TorchScript mode.

Could you mind checking the cuda 11.0 which is declared to be supported in swin doc:

Requirements CMake >= 3.13 for PyTorch CUDA 11.0 or newer version NCCL 2.10 or newer version Python 3 is recommended because some features are not supported in python 2 PyTorch: Verify on 1.10.0, >= 1.5.0 should work.

Aug 08 '22 03:08 MenglingD

I believe CUDA 11.0 is runnable. I try to build the cpp example by nvcr.io/nvidia/pytorch:20.07-py3, which contains CUDA 11.0.

I can run the cpp example successfully by following scripts:

cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release  ..
make -j
./bin/swin_example 1 0 32

Note that you need to disable the BF16 flag because the CUDNN version is too old and still not support BF16. And the cmake version of the docker is also too old. So, you also need to upgrade the cmake version.

Aug 08 '22 03:08 byshiue

I believe CUDA 11.0 is runnable. I try to build the cpp example by nvcr.io/nvidia/pytorch:20.07-py3, which contains CUDA 11.0.

I can run the cpp example successfully by following scripts:
cmake -DSM=75 -DCMAKE_BUILD_TYPE=Release  ..
make -j
./bin/swin_example 1 0 32 
Note that you need to disable the BF16 flag because the CUDNN version is too old and still not support BF16. And the cmake version of the docker is also too old. So, you also need to upgrade the cmake version.

Thanks，I will try it.

Aug 09 '22 02:08 MenglingD

@byshiue, It's FT linking the error cuda library in my docker image, which links to the libcuda.so from /usr/local/cuda/lib64/stubs/libcuda.so.

I debug into the following location to check the error code：

CUDA Error: (null) /workdir/xxx/packages/v5.0_tag/FasterTransformer-release-v5.0_tag/3rdparty/trt_fused_multihead_attention/fused_multihead_attention_v2.h 682

The error code is：CUDA_ERROR_STUB_LIBRARY , which means:

This indicates that the CUDA driver that the application has loaded is a stub library. Applications that run with the stub rather than a real driver loaded will result in CUDA API returning this error.

After I reset the link path, everything is OK!

Aug 17 '22 03:08 MenglingD