mamba LLVM ERROR for benchmark_generation_mamba

I tried to run the command python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba-130m" --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2.

The error occurs at the out = fn() step, the error is LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32 Aborted (core dumped).

I tried different version of CUDA but it doesn't solve the problem.

Feb 10 '24 01:02 zyj1729

Identical error for me, though running

python evals/lm_harness_eval.py --model mamba --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks arc_easy --device cuda --batch_size 64

Versions:

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

$ llvm-config --version
14.0.0

$ nvidia-smi
NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3

GPU:

NVIDIA GeForce GTX 1080 Ti

Update

I bought a NVIDIA GeForce RTX 3080 Ti GPU and now it just works.

Feb 16 '24 22:02 edward-cates

same issue here.

my environment setup:

conda create -n blackmamba python=3.8
conda activate blackmamba

pip3 install packaging
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal_conv1d
git checkout v1.1.0
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .

cd ..
pip3 install .

Feb 23 '24 18:02 w32zhong

Hello, I have no experience in LLVM's but just messing around, I stumbled across the same issue.

Loading model state-spaces/mamba-130m
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Number of parameters: 129135360
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32
Aborted

I am wondering if you guys could comment on what enviroment you are using(regular linux or wsl like me?)

I am on a windows 10 WSL-Ubuntu enviroment, I am able to run cuda examples perfectly fine (I've also tested on Runtime version 12.3) I also have cudNN matching for 11.8 installed too, everything is installed for 11.8

CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce GTX 1080 Ti
Result = PASS
NVIDIA-SMI 550.54.10              Driver Version: 551.61         CUDA Version: 12.4
Linux version 5.15.133.1-microsoft-standard-WSL2 (root@) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37)
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

I have switched between python versions, updated drivers, rolled back drivers, uninstalled and reinstalled different versions of cuda. Still the same issue every time.

I should also mention I am not running this on a docker container or through conda just straight through the terminal. Like I said I have no clue what I am doing 😅

Update Edit

I have found that my GTX 1080 TI does not support the architecture to run shfl.sync.bfly intrinsics :<

Mar 06 '24 13:03 ACatFromPoland

I found the error is gpu not support triton, the gpu compute capability must higher than 7.0. So the solution maybe change the gpu or change these two code to torch. https://github.com/state-spaces/mamba/blob/28b1435eb56c3082a243d23253ee7676ad737c09/mamba_ssm/ops/triton/layernorm.py#L64-L65 https://github.com/state-spaces/mamba/blob/28b1435eb56c3082a243d23253ee7676ad737c09/mamba_ssm/ops/triton/selective_state_update.py#L20-L21

May 22 '24 05:05 zhangjingxian1998

Hello, I have no experience in LLVM's but just messing around, I stumbled across the same issue.
Loading model state-spaces/mamba-130m
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Number of parameters: 129135360
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32
Aborted
I am wondering if you guys could comment on what enviroment you are using(regular linux or wsl like me?)

I am on a windows 10 WSL-Ubuntu enviroment, I am able to run cuda examples perfectly fine (I've also tested on Runtime version 12.3) I also have cudNN matching for 11.8 installed too, everything is installed for 11.8
CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 11.8, NumDevs = 1, Device0 = NVIDIA GeForce GTX 1080 Ti
Result = PASS
NVIDIA-SMI 550.54.10              Driver Version: 551.61         CUDA Version: 12.4
Linux version 5.15.133.1-microsoft-standard-WSL2 (root@) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37)
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
I have switched between python versions, updated drivers, rolled back drivers, uninstalled and reinstalled different versions of cuda. Still the same issue every time.

I should also mention I am not running this on a docker container or through conda just straight through the terminal. Like I said I have no clue what I am doing 😅

Update Edit

I have found that my GTX 1080 TI does not support the architecture to run shfl.sync.bfly intrinsic :<

Hi, I have the same issue with you. Could you tell me where did you find the fact that GTX 1080 ti does not support the architecture to run shfl.sync.bfly?

Jun 07 '24 17:06 ro-ko

Hi, I have the same issue with you. Could you tell me where did you find the fact that GTX 1080 ti does not support the architecture to run shfl.sync.bfly?

Hi, I have logically the same issue with my 1070ti did you find the reason or a way to bypass this ?

Jun 14 '24 19:06 mathysferrato

Hi, I have the same issue with you. Could you tell me where did you find the fact that GTX 1080 ti does not support the architecture to run shfl.sync.bfly?

Hi, I have logically the same issue with my 1070ti did you find the reason or a way to bypass this ?

Triton caused the problem. I finally solved this problem in my environment and my graphics card is 1050ti. You can follow issues40 compilation to install. Then uninstall triton, install triton-nightly >=3.0.0.post20240626041721.

Jul 02 '24 09:07 hhhhpaaa

LLVM ERROR for benchmark_generation_mamba_simple.py

Update

Update Edit

Update Edit