Any plan to support cpu only mode?
Really impressive results 👏
Any plan to support cpu only mode? Thus it could be used on commodity laptops such as Macbook Pro.
Or M1/M2 mac with metal support, https://github.com/ggerganov/llama.cpp/pull/1642
Is there a reason why not? Is it not possible to speed up CPU inference in the same ways?
+1 for metal support
This should dramatically reduce the cost if CPU acceleration is enabled, why not? However, a CPU server usually has more memory than GPU, so the paging may not be a accelerating point, but vllm may have other approach to boost the performance, hope to see that soon.
I have a CPU use case that may be of interest to many: having used Spark in the past, i have access to VM environments that have a ton of CPU and RAM (unfortunately, no GPUs and i can't control that)... i would love to reallocate a bunch of those resources to see how vLLM would be able to run on these preexisting (already bought and paid for) resources.
I just made a PR #2244 - feel free to test and give feedback
I only tested on Apple Silicon (M3) but the original code from #1028 was designed for Intel, so recent Intel CPUs with bfloat16 instructions should also work. Let me know if I broke anything.
hello, since I don't own any NVIDIA GPUs and can only deploy on paid, hosted solutions, I'd need cpu support for local development in order not to pay for development processes. I don't think I am the only one.
also wondering...
Very interested in this!
Very interested too!
Please
Interested too!
Also interested!
Please do not reply with only a basic "I'm interested" or "please" without contributing any new information. To show support, react to the main post with a thumbs up 👍 emoji.this sort of proposal probably needs some volunteer to come by and implement it. I wrote a cpu implementation that sadly only works for the latest ARM processors (Macbook with Apple M3), due to their support of BF16. See this excellent stackoverflow post to see cpu architectures support of various float types.THAT SAID, the cpu version I have in my branch may already work for float32 models (4 bytes per float), and I would encourage you to give it a try. Most models are distributed as fp16, so you would need to convert them to fp32 first!If we want to expand the capability, we will need an easy mode to convert the model to full 32-bit float, or use non-bfloat fp16, HOWEVER due to lack of resolution in conventional fp16, this will require finetuning a multiplier the same way fp8 type models could be implemented, which VLLM does not have support for. The other option is to go througu GPTQ which I have not personally worked with on VLLM. GPTQ would also require a cpu native implementation.Hope this summarizes the current position and challenges for support. Again if you are interested in this area, simply saying "interested" is not helpful to the conversation.On Mar 27, 2024, at 10:58, Nuno Fonseca @.***> wrote: Also interested!
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
x86 CPU support was added in https://github.com/vllm-project/vllm/pull/3634
Since there are other issues asking for specific architectures, I will close this one as complete because there is now a CPU only mode.