FasterTransformer CUBLAS_STATUS_ARCH

Description

I am facing error:
CUDA runtime error: CUBLAS_STATUS_ARCH_MISMATCH FasterTransformer/src/fastertransformer/utils/cublasMMWrapper.cc:592 

My environment:
GPU: Tesla V100-SXM3
NVIDIA-SMI 450.191.01   
Driver Version: 450.191.01   
CUDA Version: 11.1

Reproduced Steps

1. git clone https://github.com/NVIDIA/FasterTransformer.git
2. mkdir -p FasterTransformer/build
3. cd FasterTransformer/build
4. git submodule init && git submodule update
5. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
6. make
7. pip install transformers==2.5.1
8. ./bin/bert_gemm 1 32 12 64 1 0
9. python ../examples/pytorch/bert/bert_example.py 1 12 32 12 64 --data_type fp16 --time

The issue happens on execution of step 9

Aug 25 '22 09:08 HemantTiwariGitHub

Can you try to run

./bin/bert_example 1 12 32 12 64 1 0

Aug 25 '22 09:08 byshiue

This works fine and gives output as :

$ ./bin/bert_example 1 12 32 12 64 1 0

[INFO] Device: Tesla V100-SXM3-32GB Before loading model: free: 31.44 GB, total: 31.75 GB, used: 0.30 GB [FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP After loading model : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB After inference : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB [FT][INFO] batch_size 1 seq_len 32 layer 12 FT-CPP-time 1.05 ms (10 iterations)

Should I compile and use with cpp ?

Aug 25 '22 10:08 HemantTiwariGitHub

This works fine and gives output as :

$ ./bin/bert_example 1 12 32 12 64 1 0

[INFO] Device: Tesla V100-SXM3-32GB Before loading model: free: 31.44 GB, total: 31.75 GB, used: 0.30 GB [FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP After loading model : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB After inference : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB [FT][INFO] batch_size 1 seq_len 32 layer 12 FT-CPP-time 1.05 ms (10 iterations)

Should I compile and use with cpp ?

This mean that the bug you encounter should be some issues on PyTorch environment settings. Do you use pytorch docker image we suggest in the document? If not, can you try running in docker first?

Aug 26 '22 00:08 byshiue

Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.

Dec 02 '22 14:12 byshiue