mamba icon indicating copy to clipboard operation
mamba copied to clipboard

CPU inference?

Open nFunctor opened this issue 1 year ago • 4 comments

Hi, thanks for your work, the ssm architecture is very interesting!

I am using some tiny variants of mamba blocks in my work and would appreciate the possibility to have an option for CPU inference. If I understand correctly things break on device="cpu" due to causal_conv1d, and perhaps it is not too much to ask to make it fall back to F.conv1d for example? Would be much appreciated if CPU inference had an in-house implementation!

nFunctor avatar Jan 29 '24 09:01 nFunctor

Do you want to send a PR for the conv1d? The selective_scan operation is also implemented in CUDA, but there's a reference implementation in Pytorch (probably quite slow).

tridao avatar Jan 29 '24 10:01 tridao

I guess it makes sense to do a PR for both conv1d and selective_scan in the forward; there seems to be a faster than sequential implementation of the selective scan in here and perhaps this is the way to go, it would however add dependence on pscan. I'll see if I bring myself to do that.

nFunctor avatar Jan 29 '24 10:01 nFunctor

Hi, I have a fork here with a CPU-only version using a reference scan loop. It runs decent enough with small models, perhaps a compiled loop would speed it up a bit

https://github.com/proger/mamba-cpu/commits/main/

proger avatar Jan 30 '24 07:01 proger

@proger I've been trying to make use of your fork to do inference on a CPU-only device, but I'm not sure if it's configure correctly. When I build and run it out of the box, using pip install with MAMBA_SKIP_CUDA_BUILD=TRUE and MAMBA_FORCE_BUILD=TRUE, selective scan is still trying to call the cuda version of causal_conv1d. Is your fork meant to be a drop in replacement or do I need to modify my existing code to call the CPU-only version of MAMBA.

csimo005 avatar Feb 23 '24 19:02 csimo005

it's now available on llama.cpp (supporting Mamba GGUF files since february)

ekianjo avatar Mar 10 '24 13:03 ekianjo