CUBLAS_STATUS_ARCH_MISMATCH
Description
I am facing error:
CUDA runtime error: CUBLAS_STATUS_ARCH_MISMATCH FasterTransformer/src/fastertransformer/utils/cublasMMWrapper.cc:592
My environment:
GPU: Tesla V100-SXM3
NVIDIA-SMI 450.191.01
Driver Version: 450.191.01
CUDA Version: 11.1
Reproduced Steps
1. git clone https://github.com/NVIDIA/FasterTransformer.git
2. mkdir -p FasterTransformer/build
3. cd FasterTransformer/build
4. git submodule init && git submodule update
5. cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
6. make
7. pip install transformers==2.5.1
8. ./bin/bert_gemm 1 32 12 64 1 0
9. python ../examples/pytorch/bert/bert_example.py 1 12 32 12 64 --data_type fp16 --time
The issue happens on execution of step 9
Can you try to run
./bin/bert_example 1 12 32 12 64 1 0
This works fine and gives output as :
$ ./bin/bert_example 1 12 32 12 64 1 0
[INFO] Device: Tesla V100-SXM3-32GB Before loading model: free: 31.44 GB, total: 31.75 GB, used: 0.30 GB [FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP After loading model : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB After inference : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB [FT][INFO] batch_size 1 seq_len 32 layer 12 FT-CPP-time 1.05 ms (10 iterations)
Should I compile and use with cpp ?
This works fine and gives output as :
$ ./bin/bert_example 1 12 32 12 64 1 0
[INFO] Device: Tesla V100-SXM3-32GB Before loading model: free: 31.44 GB, total: 31.75 GB, used: 0.30 GB [FT][WARNING] Async cudaMalloc/Free is not supported before CUDA 11.2. Using Sync cudaMalloc/Free.Note this may lead to hang with NCCL kernels launched in parallel; if so, try NCCL_LAUNCH_MODE=GROUP After loading model : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB After inference : free: 30.23 GB, total: 31.75 GB, used: 1.52 GB [FT][INFO] batch_size 1 seq_len 32 layer 12 FT-CPP-time 1.05 ms (10 iterations)
Should I compile and use with cpp ?
This mean that the bug you encounter should be some issues on PyTorch environment settings. Do you use pytorch docker image we suggest in the document? If not, can you try running in docker first?
Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.