text-generation-webui
text-generation-webui copied to clipboard
[Request] Support for llama.cpp
I'd love to see support for llama.cpp. I am currently running the 13B model (4 bit) on a M2 MacBook Air with 24GB of ram at about 270ms per token, which all things considered is pretty good.
llama.cpp is an interesting development - supports Mac M1/M2 and x86 AVX2 instructions (ie: it's pretty quick for a CPU implementation). I'm able to load the 65B-4bit and get around 850ms per token. That said, it looks like there's a fair bit of coordination and glue needed to get these talking. llama.cpp isn't really setup as a library, nor does it offer an API. Not that those are any sort of heavy lift, but rather they might be the sort of request that needs to be implemented on their site first before it can be leveraged by others.
according to my tests, llama.cpp with 4bit quantizing is much faster than gpu+ram offload in terms of speed, at least for the 7B model. However, since this is a cpp program which requires compiling, it might be hard to implement this. Some probable approach may include making a dynamic library for every OS or patch the code so it acts as a "backend" of webui (which also requires compiling the code before hand), but either way, that would be a lot of work for ooba...
ram usage (ryzen 3700 system): 7B: 4529.34 MB 30B: 20951.50 MB 65B: 41477.73 MB
@Silver267 there's a library version one guy is doing https://github.com/j-f1/forked-llama.cpp/tree/swift posted about it here: https://github.com/ggerganov/llama.cpp/issues/23#issuecomment-1465017679
Looks like there's a draft PR for this: https://github.com/oobabooga/text-generation-webui/pull/447
https://github.com/PotatoSpudowski/fastLLaMa might be relevant
Discussion moved to https://github.com/oobabooga/text-generation-webui/issues/575