llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Usage of remote:vllm

Open TurboMa opened this issue 1 year ago • 2 comments

What I understand about this is actually deploy a model (e.g Llama3.1-70B-Instruct) by using 'vllm serve Llama3.1-70B-Instruct ... ' and then config the url and model name to llama-stack for LLM capability, but it seems that there is not much information about this part on doc and I would like to ask about what can I do once I finished deployment of my model? Thanks

TurboMa avatar Nov 05 '24 07:11 TurboMa

I would agree with this in general. This might even be the solution for everyone wanting AMD support (or Gaudi or TPU). This gives you maximum flexibility to serve the model with speculative decoding, multi-node, quantization, lora adapters, and more. But it is very much undocumented and several bugs need to be fixed before it is ready for general use. I am trying to pave the way so that this is supportable. With remote vllm, you can offload the inference to a beefy machine and run the API server on your laptop to develop software.

stevegrubb avatar Nov 05 '24 13:11 stevegrubb

https://github.com/meta-llama/llama-stack/pull/384 is a patch which just landed today and makes it work. We will add documentation and release updated packages soon.

ashwinb avatar Nov 07 '24 06:11 ashwinb