llama-stack
llama-stack copied to clipboard
LLamaGuard, routing, and vllm
System Info
Cuda 12.6, torch 2.5.1, nvidia gpu
Information
- [X] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
I'm trying to get the safety API working with remote::vllm. There isn't a lot of good documentation on how to do this. But I have found a couple bug reports in the apps repo where people report, "ValueError: Llama-Guard-3-1B not registered. Make sure there is an Inference provider serving this model"
So, I have spun up 2 instances of vllm on localhost ports 8000 and 8001. In the following issue 2 inference providers were configured but with ollama, https://github.com/meta-llama/llama-stack-apps/issues/89#issuecomment-2415187248
So, I have tried to replicate that by using the following run.yaml configuration:
providers: inference:
- provider_id: inf::vllm provider_type: remote::vllm config: url: http://127.0.0.1:8000/v1 api_token: fake model: Llama-3.2-1B
- provider_id: safety::vllm provider_type: remote::vllm config: url: http://127.0.0.1:8001/v1 api_token: fake model: Llama-Guard-3-1B
Then from a terminal on the same system I run: python -m llama_stack.apis.safety.client localhost 5000
Both the server and client have a traceback. The client side traceback is a result of the server side. It just says internal server error 500. There is only one message on the Guard server saying there was a connection. When the python code above ran, it didn't make it far enough along to connect to the server.
- Is this the right recipe for connecting to 2 instances, one for inference and the other for safety?
- Does this need a routing table?
Error logs
From the Guard Server: INFO: 192.168.1.8:33614 - "GET /v1/models HTTP/1.1" 200 OK Nothing else but metrics messages
This is from the container running llama-stack:
Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 240, in endpoint return await maybe_await(value) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 200, in maybe_await return await value File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 161, in run_shield return await self.routing_table.get_provider_impl(shield_type).run_shield( File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/safety.py", line 77, in run_shield res = await shield.run(messages) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/llama_guard.py", line 187, in run async for chunk in await self.inference_api.chat_completion( File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 101, in chat_completion return (chunk async for chunk in await provider.chat_completion(**params)) TypeError: object async_generator can't be used in 'await' expression INFO: ::1:36496 - "POST /safety/run_shield HTTP/1.1" 500 Internal Server Error
Expected behavior
I was hoping it might work like shown here: https://github.com/meta-llama/llama-stack-apps
async for chunk in await self.inference_api.chat_completion(
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 101, in chat_completion
return (chunk async for chunk in await provider.chat_completion(**params))
this looks like a bug. let me check quickly.
Found it
https://github.com/meta-llama/llama-stack/blob/adecb2a2d3bc5b5fb12280c54096706974e58201/llama_stack/providers/adapters/inference/vllm/vllm.py#L89
needs to be async def chat_completion
Fixed it. See https://github.com/meta-llama/llama-stack/commit/bf4f97a2e190e41cddb96ad9cb1bf4fde5d673fb
@stevegrubb would you need an updated container, or are you able to rebuild a container locally with the updated commits? (Using LLAMA_STACK_DIR=... etc.)?
Anyhow here's the instructions if you wanted to rebuild the container using llama stack build:
cd <your_repo_root>
git pull --rebase
LLAMA_STACK_DIR=<your_repo_root> llama stack build --config <appropriate_build_config> --image-type docker
Let me know if you need more help around figuring out what the appropriate build config should be.
Unrelatedly, we will make sure we update our documentation to cover these aspects.
Thanks. I'll test this tomorrow.
OK, was able to test that. It fixed the immediate problem. After that is fixed, we run into this:
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 101, in
I fixed this by deleting the "request" variable on line 136 to match other inference providers.
This can be closed out. The 2 issues discussed here have patches that are merged. Thanks.