vllm Any plan to support cpu only mode?

Really impressive results 👏

Any plan to support cpu only mode? Thus it could be used on commodity laptops such as Macbook Pro.

Jun 21 '23 03:06 happy15

Or M1/M2 mac with metal support, https://github.com/ggerganov/llama.cpp/pull/1642

Jun 21 '23 04:06 okpatil4u

Is there a reason why not? Is it not possible to speed up CPU inference in the same ways?

Aug 15 '23 16:08 NewtonTrendy

+1 for metal support

Oct 07 '23 01:10 willtejeda

This should dramatically reduce the cost if CPU acceleration is enabled, why not? However, a CPU server usually has more memory than GPU, so the paging may not be a accelerating point, but vllm may have other approach to boost the performance, hope to see that soon.

Nov 01 '23 05:11 xiaotiancd

I have a CPU use case that may be of interest to many: having used Spark in the past, i have access to VM environments that have a ton of CPU and RAM (unfortunately, no GPUs and i can't control that)... i would love to reallocate a bunch of those resources to see how vLLM would be able to run on these preexisting (already bought and paid for) resources.

Dec 20 '23 14:12 isandburn

I just made a PR #2244 - feel free to test and give feedback

I only tested on Apple Silicon (M3) but the original code from #1028 was designed for Intel, so recent Intel CPUs with bfloat16 instructions should also work. Let me know if I broke anything.

Dec 22 '23 14:12 pathorn

hello, since I don't own any NVIDIA GPUs and can only deploy on paid, hosted solutions, I'd need cpu support for local development in order not to pay for development processes. I don't think I am the only one.

Jan 12 '24 14:01 sd3ntato

also wondering...

Jan 25 '24 16:01 jyono

Very interested in this!

Mar 08 '24 22:03 iamjameswalters

Very interested too!

Mar 14 '24 01:03 liwenju0

Please

Mar 25 '24 22:03 TwoTwenty

Interested too!

Mar 27 '24 00:03 mkumatag

Also interested!

Mar 27 '24 17:03 nfplay

Please do not reply with only a basic "I'm interested" or "please" without contributing any new information. To show support, react to the main post with a thumbs up 👍 emoji.this sort of proposal probably needs some volunteer to come by and implement it. I wrote a cpu implementation that sadly only works for the latest ARM processors (Macbook with Apple M3), due to their support of BF16. See this excellent stackoverflow post to see cpu architectures support of various float types.THAT SAID, the cpu version I have in my branch may already work for float32 models (4 bytes per float), and I would encourage you to give it a try. Most models are distributed as fp16, so you would need to convert them to fp32 first!If we want to expand the capability, we will need an easy mode to convert the model to full 32-bit float, or use non-bfloat fp16, HOWEVER due to lack of resolution in conventional fp16, this will require finetuning a multiplier the same way fp8 type models could be implemented, which VLLM does not have support for. The other option is to go througu GPTQ which I have not personally worked with on VLLM. GPTQ would also require a cpu native implementation.Hope this summarizes the current position and challenges for support. Again if you are interested in this area, simply saying "interested" is not helpful to the conversation.On Mar 27, 2024, at 10:58, Nuno Fonseca @.***> wrote: Also interested!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

Mar 27 '24 19:03 pathorn

x86 CPU support was added in https://github.com/vllm-project/vllm/pull/3634

Since there are other issues asking for specific architectures, I will close this one as complete because there is now a CPU only mode.

Apr 18 '24 11:04 hmellor