refact EPIC: Run Self-hosted version on CPU

Oct 17 '23 10:10 klink

I'd like to mention that if you handle ollama or llama.cpp interop you'll get models that run on a CPU for free. Ollama comes with an web API out of the box, and I think llama.cpp does as well. A lot of projects are allowing users to target Open AI API compatible end points.

Oct 19 '23 18:10 comalice

I would be interested in having an option to run on a CPU too, as an addition to the GPU, just to maximise the benefits I get from the GPUs I have available. For example, running starcoder 7B on my GPU for code completion and llama 7B on the CPU for chat functionality in the VSCode plugin. Right now if I want to have both functionalities I have to resort to using the smallest models to make sure they fit in my GFX card's VRAM.

Nov 07 '23 13:11 octopusx

Hi @octopusx We tested various models on CPU, it's about 4-8 seconds for a single code completion, even for 1.6b or a starcoder 1b, on Apple M1 hardware. Maybe we'll train a smaller model still (0.3b?) to make it work with a smaller context. 7b on CPU will be probably good enough for chat, because the context prefill is so small, but not for code completion.

Nov 07 '23 13:11 olegklimov

@olegklimov for sure I don't want to run anything on the CPU if I can avoid it, and especially not the code completion part. I was only thinking of moving the chat function to the CPU to free up my GPU to do higher quality code completion. Currently I run llama.cpp on CPU for chat-based openAPI integrations with a llama 2 7b chat model (https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_K_M.gguf) and on a Ryzen 3000 series CPUs I am getting close to instant chat responses. The key issue I have with this setup is that, for example, I can not point my refact plugin to the llama.cpp endpoint for chat, and I can not point the other chat integrations to the self hosted refact, so I am having to host 2 solutions at the same time basically...

Nov 07 '23 13:11 octopusx

Ah I see, that makes total sense.

I think the best way to solve this is to add providers to the rust layer, for the new plugins. We'll release the plugins "as is" this week, because we need to release it and start getting feedback. Then ~next week we'll add the concept of providers to the rust layer. You'll be able hopefully to direct requests to your llama.cpp server.

Nov 07 '23 14:11 olegklimov

This is amazing, I will be on the lookout for the new releases and test this as soon as it's available.

Nov 07 '23 15:11 octopusx