RuntimeError: CUDA error: no kernel image is available for execution on the device
is a p5200 enough for this?
]
Traceback (most recent call last):
File "/home/user/mamba/simplermambassm.py", line 259, in TORCH_USE_CUDA_DSA to enable device-side assertions.
@tridao (I am not sure if this is just a hack, but for us old guys with CCC < 7, can we do this?)
I see that the Quadro P5200 has Cuda Compute capability 6.1. I saw the same error with my GeForce GTX 1070 (which is also Compute Capability 6.1)
I was able to fix it by compiling the causal-conv1d dependency from source, as follows:
git clone https://github.com/Dao-AILab/causal-conv1d.git
# this is the latest version that Mamba supports:
git checkout v1.0.2
cd causal-conv
# edit setup.py to add the lines here:
cc_flag.append("-gencode")
cc_flag.append("arch=compute_60,code=sm_60")
Here is where you need to add those lines.
Then, compile it from source with:
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .
You can use the following script to test whether it is working properly:
import torch
from causal_conv1d import causal_conv1d_fn
batch, dim, seq, width = 10, 5, 17, 4
x = torch.zeros((batch, dim, seq)).to('cuda')
weight = torch.zeros((dim, width)).to('cuda')
bias = torch.zeros((dim, )).to('cuda')
causal_conv1d_fn(x, weight, bias, None)
EDIT: Just realized the Mamba repo also assumes CCC >= 7. So, I did a similar edit to the mamba setup.py and compiled it with:
henry@henry-gs65:mamba$ MAMBA_FORCE_BUILD=TRUE pip install .
(This takes about 10 minutes to compile)
Once doing this, the top-level Mamba demo works:
import torch
from mamba_ssm import Mamba
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba(
# This module uses roughly 3 * expand * d_model^2 parameters
d_model=dim, # Model dimension d_model
d_state=16, # SSM state expansion factor
d_conv=4, # Local convolution width
expand=2, # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape
close
>>> y = model(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/mamba_ssm/modules/mamba_simple.py", line 149, in forward
out = mamba_inner_fn(
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 306, in mamba_inner_fn
return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 113, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 181, in forward
conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd(x, conv1d_weight, conv1d_bias, True)
TypeError: causal_conv1d_fwd(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: Optional[torch.Tensor], arg3: Optional[torch.Tensor], arg4: bool) -> torch.Tensor
Invoked with: tensor([[[-0.4806, 1.2685, 0.3929, ..., 0.3327, 0.3938, -0.5350],
[ 0.9421, -0.1715, -0.0481, ..., -0.1955, -0.8604, -0.4096],
[ 0.5454, -0.1034, -0.2881, ..., 0.2157, -1.2089, -0.3394],
...,
[ 0.3014, 0.2976, -0.3656, ..., -0.4423, -0.8560, -0.3013],
[-0.3690, -0.3119, -0.1994, ..., -0.4742, -0.6223, 0.2423],
[-0.7320, 1.4818, 0.6340, ..., -0.4294, 0.2926, -0.0436]],
[[ 0.4325, -0.4794, 0.4466, ..., 0.1774, 0.8001, -0.0083],
[-0.2831, -0.2780, 0.3027, ..., 0.3467, -1.0696, 0.2190],
[-0.7058, 0.7942, -0.5447, ..., 0.5141, -0.9554, -0.0649],
...,
[-0.7701, 0.9309, -0.6030, ..., 0.2993, -0.0422, -0.1484],
[ 0.5808, 0.4285, -0.5568, ..., 1.3064, -1.0199, -0.3363],
[ 0.0734, 0.0993, 0.6768, ..., -0.1356, 0.9295, -0.1664]]],
device='cuda:0', requires_grad=True), tensor([[-0.0555, 0.4169, 0.2594, -0.4943],
[-0.0554, 0.0376, 0.1702, 0.4476],
[-0.1875, 0.4470, 0.2299, -0.0788],
[-0.2496, 0.4405, -0.0241, 0.0307],
[ 0.2666, -0.2731, -0.1284, -0.3504],
[ 0.2001, 0.1497, 0.2172, 0.1289],
[ 0.3474, 0.3953, 0.2375, 0.0597],
[ 0.0498, 0.1374, -0.0508, -0.1526],
[-0.2388, -0.2890, -0.4515, 0.0008],
[-0.2706, -0.4276, -0.4668, 0.4245],
[ 0.0252, 0.0295, -0.4991, 0.2078],
[ 0.2212, 0.3381, -0.3815, 0.1831],
[-0.3029, -0.3729, -0.1333, -0.1371],
[-0.3745, 0.0316, -0.1675, 0.0064],
[ 0.4358, 0.4920, -0.4541, -0.0722],
[ 0.2807, -0.1016, -0.4563, -0.3044],
[ 0.1035, 0.0162, 0.4479, 0.3260],
[-0.2877, 0.1106, 0.4981, 0.4084],
[-0.3320, -0.3829, -0.1360, 0.3744],
[-0.3771, -0.3639, -0.1163, 0.3709],
[-0.2274, -0.4964, -0.0816, 0.4454],
[ 0.1764, -0.0485, 0.3448, -0.4393],
[-0.3905, -0.3605, 0.0623, -0.2038],
[-0.2044, -0.1454, -0.1526, -0.4165],
[-0.0414, 0.1940, 0.3441, -0.3418],
[ 0.4200, -0.2309, 0.1998, -0.1196],
[-0.4553, 0.1990, 0.4579, 0.1669],
[-0.3292, 0.0408, -0.4167, 0.3332],
[ 0.4237, 0.4848, -0.3006, -0.2292],
[ 0.4939, 0.1801, -0.1294, 0.0011],
[ 0.3516, -0.3912, 0.3251, 0.3016],
[-0.0648, -0.0567, -0.3247, 0.4323]], device='cuda:0',
requires_grad=True), Parameter containing:
tensor([-3.1444e-01, 4.3207e-02, 2.2112e-01, -3.4120e-01, 4.0195e-01,
-1.4227e-01, -4.5976e-01, -3.6258e-04, -4.6205e-01, 1.7177e-01,
4.6020e-01, -1.7618e-01, 2.0168e-01, 1.2738e-01, 2.8975e-01,
-4.2130e-01, -2.3378e-01, -1.8998e-01, -9.5853e-02, -2.4321e-01,
-1.0333e-02, -2.0879e-01, 1.2288e-01, 5.1831e-02, -4.9842e-02,
-3.1233e-01, 1.4064e-01, -2.4546e-01, 3.0703e-01, 1.4846e-02,
7.5587e-02, -3.6691e-01], device='cuda:0', requires_grad=True), True
>>>
oops, I did it out of order: nm, still produced the same error after applying same process to mamba's setup.py
Fyi for us newbs
CCC stands for "CUDA Compute Capability," a numerical value that represents the features supported by a CUDA (Compute Unified Device Architecture) hardware (typically a GPU). CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing (an approach known as GPGPU, General-Purpose computing on Graphics Processing Units).
The Compute Capability is a version number indicating the features supported by the GPU. Different versions of CUDA GPUs support different features and therefore have different Compute Capabilities. For example, the Quadro P5200 and GeForce GTX 1070 GPUs mentioned have a Compute Capability of 6.1. This version number is important for developers because they need to compile their programs for a specific Compute Capability to ensure compatibility and optimal performance on the target GPU.
When you modify a setup.py file of a Python package to include specific Compute Capability flags, you are instructing the compiler to generate code optimized for GPUs with that particular Compute Capability. This is often necessary when working with older GPUs or when the pre-compiled binaries of a library do not support the specific Compute Capability of your GPU.
btw, I had to do something similar to get ctransformers to work
oops, sorry but I forgot a crucial thing. Mamba states that it requires causal_conv1d version <= 1.0.2. I forgot to mention this. So, you need to do a git checkout v1.0.2 before you do the pip install. From where you are now, I'd say it would be:
$ cd causal-conv1d
$ git checkout v1.0.2
# you've already edited the setup.py file I assume
$ pip uninstall causal-conv1d
$ CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .
At this point, it may work ;) Since Mamba dynamically loads the causal-conv1d python module, no re-compilation of mamba is necessary. But I am not positive of that.
(i edited the original instruction to reflect this just now)
Sorry I'm traveling this week but will have time to look into this next week.
Processing /home/user/mamba/causal-conv1d
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [17 lines of output]
Traceback (most recent call last):
File "/home/user/lit-gpt/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
main()
File "/home/user/lit-gpt/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/user/lit-gpt/env/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-w4x0ekut/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-w4x0ekut/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-w4x0ekut/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 480, in run_setup
super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
File "/tmp/pip-build-env-w4x0ekut/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 311, in run_setup
exec(code, locals())
File "<string>", line 9, in <module>
ModuleNotFoundError: No module named 'packaging'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a
despite installing python3-packaging and pip install packaging (and can confirm I can import packaging)
nm
pip install wheel
python setup.py
yay, that did it back in the game =D
@tridao (I am not sure if this is just a hack, but for us old guys with CCC < 7, can we do this?)
I see that the Quadro P5200 has Cuda Compute capability 6.1. I saw the same error with my GeForce GTX 1070 (which is also Compute Capability 6.1)
I was able to fix it by compiling the causal-conv1d dependency from source, as follows:
git clone https://github.com/Dao-AILab/causal-conv1d.git # this is the latest version that Mamba supports: git checkout v1.0.2 cd causal-conv # edit setup.py to add the lines here: cc_flag.append("-gencode") cc_flag.append("arch=compute_60,code=sm_60")Here is where you need to add those lines.
Then, compile it from source with:
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install .You can use the following script to test whether it is working properly:
import torch from causal_conv1d import causal_conv1d_fn batch, dim, seq, width = 10, 5, 17, 4 x = torch.zeros((batch, dim, seq)).to('cuda') weight = torch.zeros((dim, width)).to('cuda') bias = torch.zeros((dim, )).to('cuda') causal_conv1d_fn(x, weight, bias, None)EDIT: Just realized the Mamba repo also assumes CCC >= 7. So, I did a similar edit to the mamba setup.py and compiled it with:
henry@henry-gs65:mamba$ MAMBA_FORCE_BUILD=TRUE pip install .(This takes about 10 minutes to compile)
Once doing this, the top-level Mamba demo works:
import torch from mamba_ssm import Mamba batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba( # This module uses roughly 3 * expand * d_model^2 parameters d_model=dim, # Model dimension d_model d_state=16, # SSM state expansion factor d_conv=4, # Local convolution width expand=2, # Block expansion factor ).to("cuda") y = model(x) assert y.shape == x.shape
Oh, God, I solve it! Love from P40 (CCC 6.1)!!!
this solution no longer works for the latest mamba build
(mamba-venv) [root@pve-m7330 mamba]# python Python 3.10.9 (main, Mar 8 2023, 10:47:38) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import torch d_state=16, # SSM state expansion factor d_conv=4, # Local convolution width expand=2, # Block expansion factor ).to("cuda") y = model(x) assert y.shape == x.shape from mamba_ssm import Mamba /home/user/mamba/mamba_ssm/ops/selective_scan_interface.py:164: FutureWarning:
torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight, /home/user/mamba/mamba_ssm/ops/selective_scan_interface.py:240: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, dout): /home/user/mamba/mamba_ssm/ops/triton/layer_norm.py:986: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward( /home/user/mamba/mamba_ssm/ops/triton/layer_norm.py:1045: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, dout, *args): /home/user/mamba/mamba_ssm/distributed/tensor_parallel.py:26: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, x, weight, bias, process_group=None, sequence_parallel=True): /home/user/mamba/mamba_ssm/distributed/tensor_parallel.py:62: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, grad_output): /home/user/mamba/mamba_ssm/ops/triton/ssd_combined.py:758: FutureWarning:torch.cuda.amp.custom_fwd(args...)is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')instead. def forward(ctx, zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states=None, seq_idx=None, dt_limit=(0.0, float("inf")), return_final_states=False, activation="silu", /home/user/mamba/mamba_ssm/ops/triton/ssd_combined.py:836: FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead. def backward(ctx, dout, *args):batch, length, dim = 2, 64, 16 x = torch.randn(batch, length, dim).to("cuda") model = Mamba( ... # This module uses roughly 3 * expand * d_model^2 parameters ... d_model=dim, # Model dimension d_model ... d_state=16, # SSM state expansion factor ... d_conv=4, # Local convolution width ... expand=2, # Block expansion factor ... ).to("cuda") y = model(x) Traceback (most recent call last): File "
", line 1, in File "/home/user/mamba-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/user/mamba-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/home/user/mamba/mamba_ssm/modules/mamba_simple.py", line 146, in forward out = mamba_inner_fn( File "/home/user/mamba/mamba_ssm/ops/selective_scan_interface.py", line 317, in mamba_inner_fn return MambaInnerFn.apply(xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight, File "/home/user/mamba-venv/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/user/mamba-venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 455, in decorate_fwd return fwd(*args, **kwargs) File "/home/user/mamba/mamba_ssm/ops/selective_scan_interface.py", line 187, in forward conv1d_out = causal_conv1d_cuda.causal_conv1d_fwd( TypeError: causal_conv1d_fwd(): incompatible function arguments. The following argument types are supported: 1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: Optional[torch.Tensor], arg3: bool) -> torch.Tensor
Invoked with: tensor([[[ 0.0585, 0.7326, 1.0051, ..., 0.5815, 0.5123, 0.3603], [-1.1954, -0.1859, -0.5755, ..., 0.1994, -0.8212, -0.0045], [-0.8660, -0.0906, 0.4311, ..., 0.3252, -1.0255, 0.2714], ..., [-0.2807, -0.6320, -0.0296, ..., 0.0924, -0.1205, 0.0907], [ 0.7265, -0.5012, 0.8706, ..., -0.2371, 0.5663, 0.0296], [-0.7008, 0.5792, -0.0274, ..., 0.0996, -0.2764, -0.2743]],
[[-0.1144, 0.0690, 0.9025, ..., -0.4769, -0.3929, -0.0055],
[ 0.6885, 0.4210, -0.0834, ..., -0.0957, 0.4545, -1.1244],
[ 0.6075, 0.0818, 0.1200, ..., -0.2688, -0.4929, 0.0289],
...,
[-0.6323, -0.2017, -0.3172, ..., -0.2214, 0.8582, 1.2913],
[ 0.3027, -0.1702, 0.2258, ..., -0.6353, 0.1729, -0.2953],
[ 1.0242, 0.3739, -0.4389, ..., 0.5205, 0.1748, -0.4015]]],
device='cuda:0', requires_grad=True), tensor([[ 0.0987, 0.1059, 0.1087, 0.2217],
[-0.3802, -0.0909, -0.4198, -0.1257],
[ 0.1083, -0.0419, 0.4580, 0.4761],
[ 0.1094, -0.0357, -0.0371, 0.0531],
[-0.3875, 0.1226, -0.1692, 0.1329],
[-0.1424, 0.2907, -0.4435, -0.4014],
[-0.2112, 0.4587, 0.0418, 0.4757],
[ 0.1140, 0.3250, 0.3956, -0.4221],
[-0.1109, 0.4271, 0.1018, 0.1395],
[ 0.4083, 0.1258, -0.3790, 0.0996],
[-0.0062, -0.2871, 0.3098, 0.3148],
[-0.0467, -0.2842, 0.3562, -0.0613],
[ 0.1013, -0.2330, 0.2027, 0.2846],
[ 0.0039, -0.2095, -0.4826, 0.2009],
[-0.2955, -0.1617, -0.0491, 0.3483],
[-0.4664, 0.0722, 0.1840, 0.3535],
[-0.0564, 0.2365, -0.3335, 0.1983],
[-0.1127, 0.0549, -0.1763, 0.1116],
[ 0.2882, 0.4756, 0.3223, 0.2688],
[-0.2654, 0.0236, 0.3968, 0.2946],
[-0.0341, 0.0547, 0.1876, 0.0800],
[-0.2642, -0.2790, 0.3583, 0.2026],
[-0.4669, 0.3040, -0.1916, -0.4390],
[ 0.3570, -0.4490, -0.3143, -0.3155],
[ 0.2841, -0.4582, -0.1350, -0.4604],
[ 0.0170, -0.0625, 0.0056, -0.1275],
[-0.3403, 0.2159, -0.1715, -0.2652],
[-0.2765, -0.3144, -0.2965, 0.1824],
[-0.0823, -0.2959, -0.2007, 0.1748],
[ 0.3444, 0.4872, -0.4085, 0.1206],
[-0.1521, 0.4988, -0.1553, 0.3104],
[ 0.2039, 0.4049, 0.2656, 0.3132]], device='cuda:0',
requires_grad=True), Parameter containing:
tensor([-0.4841, -0.3013, 0.4225, -0.3196, 0.1289, -0.2113, 0.2934, 0.0558, -0.3115, 0.4327, 0.1833, -0.3552, -0.3535, -0.3619, -0.2438, 0.3835, -0.0902, -0.0893, 0.1190, 0.1235, 0.3639, 0.2415, 0.0895, 0.0057, -0.1587, 0.4039, -0.0957, -0.0197, -0.4331, -0.4305, -0.3638, -0.2179], device='cuda:0', requires_grad=True), None, None, None, True
assert y.shape == x.shape Traceback (most recent call last): File "
", line 1, in NameError: name 'y' is not defined
I'm going to try a newer causal-conv1d than 1.0.2, but these instructions used to work... why not include compute 6.0 in the source rather than have us patch it in? Compute 5.3 is in there.
disregard. Works when I compile with causal-conv1d v1.4.0 with compute_60 patched into setup.py
although, this makes a good argument why not include this by default?
unfortunately
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
when trying to load import torch
I would think if this patch was better integrated, this wouldn't happen
nevermind, I fixed that by pip install numpy==1.*