mamba CPU inference?

Hi, thanks for your work, the ssm architecture is very interesting!

I am using some tiny variants of mamba blocks in my work and would appreciate the possibility to have an option for CPU inference. If I understand correctly things break on device="cpu" due to causal_conv1d, and perhaps it is not too much to ask to make it fall back to F.conv1d for example? Would be much appreciated if CPU inference had an in-house implementation!

Jan 29 '24 09:01 nFunctor

Do you want to send a PR for the conv1d? The selective_scan operation is also implemented in CUDA, but there's a reference implementation in Pytorch (probably quite slow).

Jan 29 '24 10:01 tridao

I guess it makes sense to do a PR for both conv1d and selective_scan in the forward; there seems to be a faster than sequential implementation of the selective scan in here and perhaps this is the way to go, it would however add dependence on pscan. I'll see if I bring myself to do that.

Jan 29 '24 10:01 nFunctor

Hi, I have a fork here with a CPU-only version using a reference scan loop. It runs decent enough with small models, perhaps a compiled loop would speed it up a bit

https://github.com/proger/mamba-cpu/commits/main/

Jan 30 '24 07:01 proger

@proger I've been trying to make use of your fork to do inference on a CPU-only device, but I'm not sure if it's configure correctly. When I build and run it out of the box, using pip install with MAMBA_SKIP_CUDA_BUILD=TRUE and MAMBA_FORCE_BUILD=TRUE, selective scan is still trying to call the cuda version of causal_conv1d. Is your fork meant to be a drop in replacement or do I need to modify my existing code to call the CPU-only version of MAMBA.

Feb 23 '24 19:02 csimo005

it's now available on llama.cpp (supporting Mamba GGUF files since february)

Mar 10 '24 13:03 ekianjo

mamba mamba copied to clipboard

CPU inference?

mamba
mamba copied to clipboard