LLVM ERROR for benchmark_generation_mamba_simple.py
I tried to run the command python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba-130m" --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2.
The error occurs at the out = fn() step, the error is LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32 Aborted (core dumped).
I tried different version of CUDA but it doesn't solve the problem.
Identical error for me, though running
python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks arc_easy --device cuda --batch_size 64
Versions:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
$ llvm-config --version
14.0.0
$ nvidia-smi
NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3
GPU:
NVIDIA GeForce GTX 1080 Ti
Update
I bought a NVIDIA GeForce RTX 3080 Ti GPU and now it just works.
same issue here.
my environment setup:
conda create -n blackmamba python=3.8
conda activate blackmamba
pip3 install packaging
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal_conv1d
git checkout v1.1.0
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .
cd ..
pip3 install .
Hello, I have no experience in LLVM's but just messing around, I stumbled across the same issue.
Loading model state-spaces/mamba-130m
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Number of parameters: 129135360
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32
Aborted
I am wondering if you guys could comment on what enviroment you are using(regular linux or wsl like me?)
I am on a windows 10 WSL-Ubuntu enviroment, I am able to run cuda examples perfectly fine (I've also tested on Runtime version 12.3) I also have cudNN matching for 11.8 installed too, everything is installed for 11.8
CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce GTX 1080 Ti
Result = PASS
NVIDIA-SMI 550.54.10 Driver Version: 551.61 CUDA Version: 12.4
Linux version 5.15.133.1-microsoft-standard-WSL2 (root@) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37)
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
I have switched between python versions, updated drivers, rolled back drivers, uninstalled and reinstalled different versions of cuda. Still the same issue every time.
I should also mention I am not running this on a docker container or through conda just straight through the terminal. Like I said I have no clue what I am doing 😅
Update Edit
I have found that my GTX 1080 TI does not support the architecture to run shfl.sync.bfly intrinsics :<
I found the error is gpu not support triton, the gpu compute capability must higher than 7.0. So the solution maybe change the gpu or change these two code to torch. https://github.com/state-spaces/mamba/blob/28b1435eb56c3082a243d23253ee7676ad737c09/mamba_ssm/ops/triton/layernorm.py#L64-L65 https://github.com/state-spaces/mamba/blob/28b1435eb56c3082a243d23253ee7676ad737c09/mamba_ssm/ops/triton/selective_state_update.py#L20-L21
Hello, I have no experience in LLVM's but just messing around, I stumbled across the same issue.
Loading model state-spaces/mamba-130m Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Number of parameters: 129135360 LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32 AbortedI am wondering if you guys could comment on what enviroment you are using(regular linux or wsl like me?)
I am on a windows 10 WSL-Ubuntu enviroment, I am able to run cuda examples perfectly fine (I've also tested on Runtime version 12.3) I also have cudNN matching for 11.8 installed too, everything is installed for 11.8
CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce GTX 1080 Ti Result = PASS NVIDIA-SMI 550.54.10 Driver Version: 551.61 CUDA Version: 12.4 Linux version 5.15.133.1-microsoft-standard-WSL2 (root@) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0I have switched between python versions, updated drivers, rolled back drivers, uninstalled and reinstalled different versions of cuda. Still the same issue every time.
I should also mention I am not running this on a docker container or through conda just straight through the terminal. Like I said I have no clue what I am doing 😅
Update Edit
I have found that my GTX 1080 TI does not support the architecture to run shfl.sync.bfly intrinsic :<
Hi, I have the same issue with you. Could you tell me where did you find the fact that GTX 1080 ti does not support the architecture to run shfl.sync.bfly?
Hi, I have the same issue with you. Could you tell me where did you find the fact that GTX 1080 ti does not support the architecture to run shfl.sync.bfly?
Hi, I have logically the same issue with my 1070ti did you find the reason or a way to bypass this ?
Hi, I have the same issue with you. Could you tell me where did you find the fact that GTX 1080 ti does not support the architecture to run shfl.sync.bfly?
Hi, I have logically the same issue with my 1070ti did you find the reason or a way to bypass this ?
Triton caused the problem. I finally solved this problem in my environment and my graphics card is 1050ti. You can follow issues40 compilation to install. Then uninstall triton, install triton-nightly >=3.0.0.post20240626041721.