mamba
mamba copied to clipboard
CPU inference?
Hi, thanks for your work, the ssm architecture is very interesting!
I am using some tiny variants of mamba blocks in my work and would appreciate the possibility to have an option for CPU inference. If I understand correctly things break on device="cpu" due to causal_conv1d, and perhaps it is not too much to ask to make it fall back to F.conv1d for example? Would be much appreciated if CPU inference had an in-house implementation!
Do you want to send a PR for the conv1d? The selective_scan operation is also implemented in CUDA, but there's a reference implementation in Pytorch (probably quite slow).
I guess it makes sense to do a PR for both conv1d and selective_scan in the forward; there seems to be a faster than sequential implementation of the selective scan in here and perhaps this is the way to go, it would however add dependence on pscan. I'll see if I bring myself to do that.
Hi, I have a fork here with a CPU-only version using a reference scan loop. It runs decent enough with small models, perhaps a compiled loop would speed it up a bit
https://github.com/proger/mamba-cpu/commits/main/
@proger I've been trying to make use of your fork to do inference on a CPU-only device, but I'm not sure if it's configure correctly. When I build and run it out of the box, using pip install with MAMBA_SKIP_CUDA_BUILD=TRUE and MAMBA_FORCE_BUILD=TRUE, selective scan is still trying to call the cuda version of causal_conv1d. Is your fork meant to be a drop in replacement or do I need to modify my existing code to call the CPU-only version of MAMBA.
it's now available on llama.cpp (supporting Mamba GGUF files since february)