llama-stack LLamaGuard, routing, and vllm

System Info

Cuda 12.6, torch 2.5.1, nvidia gpu

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

I'm trying to get the safety API working with remote::vllm. There isn't a lot of good documentation on how to do this. But I have found a couple bug reports in the apps repo where people report, "ValueError: Llama-Guard-3-1B not registered. Make sure there is an Inference provider serving this model"

So, I have spun up 2 instances of vllm on localhost ports 8000 and 8001. In the following issue 2 inference providers were configured but with ollama, https://github.com/meta-llama/llama-stack-apps/issues/89#issuecomment-2415187248

So, I have tried to replicate that by using the following run.yaml configuration:

providers: inference:

provider_id: inf::vllm provider_type: remote::vllm config: url: http://127.0.0.1:8000/v1 api_token: fake model: Llama-3.2-1B
provider_id: safety::vllm provider_type: remote::vllm config: url: http://127.0.0.1:8001/v1 api_token: fake model: Llama-Guard-3-1B

Then from a terminal on the same system I run: python -m llama_stack.apis.safety.client localhost 5000

Both the server and client have a traceback. The client side traceback is a result of the server side. It just says internal server error 500. There is only one message on the Guard server saying there was a connection. When the python code above ran, it didn't make it far enough along to connect to the server.

Is this the right recipe for connecting to 2 instances, one for inference and the other for safety?
Does this need a routing table?

Error logs

From the Guard Server: INFO: 192.168.1.8:33614 - "GET /v1/models HTTP/1.1" 200 OK Nothing else but metrics messages

This is from the container running llama-stack:

Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 240, in endpoint return await maybe_await(value) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 200, in maybe_await return await value File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 161, in run_shield return await self.routing_table.get_provider_impl(shield_type).run_shield( File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/safety.py", line 77, in run_shield res = await shield.run(messages) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/llama_guard.py", line 187, in run async for chunk in await self.inference_api.chat_completion( File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 101, in chat_completion return (chunk async for chunk in await provider.chat_completion(**params)) TypeError: object async_generator can't be used in 'await' expression INFO: ::1:36496 - "POST /safety/run_shield HTTP/1.1" 500 Internal Server Error

Expected behavior

I was hoping it might work like shown here: https://github.com/meta-llama/llama-stack-apps

Nov 01 '24 20:11 stevegrubb

async for chunk in await self.inference_api.chat_completion(
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 101, in chat_completion
return (chunk async for chunk in await provider.chat_completion(**params))

this looks like a bug. let me check quickly.

Nov 01 '24 20:11 ashwinb

Found it

https://github.com/meta-llama/llama-stack/blob/adecb2a2d3bc5b5fb12280c54096706974e58201/llama_stack/providers/adapters/inference/vllm/vllm.py#L89

needs to be async def chat_completion

Nov 01 '24 20:11 ashwinb

Fixed it. See https://github.com/meta-llama/llama-stack/commit/bf4f97a2e190e41cddb96ad9cb1bf4fde5d673fb

@stevegrubb would you need an updated container, or are you able to rebuild a container locally with the updated commits? (Using LLAMA_STACK_DIR=... etc.)?

Nov 01 '24 20:11 ashwinb

Anyhow here's the instructions if you wanted to rebuild the container using llama stack build:

cd <your_repo_root>
git pull --rebase
LLAMA_STACK_DIR=<your_repo_root> llama stack build --config <appropriate_build_config> --image-type docker

Let me know if you need more help around figuring out what the appropriate build config should be.

Unrelatedly, we will make sure we update our documentation to cover these aspects.

Nov 01 '24 20:11 ashwinb

Thanks. I'll test this tomorrow.

Nov 02 '24 02:11 stevegrubb

OK, was able to test that. It fixed the immediate problem. After that is fixed, we run into this:

File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 101, in return (chunk async for chunk in await provider.chat_completion(**params)) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/adapters/inference/vllm/vllm.py", line 136, in _stream_chat_completion async for chunk in process_chat_completion_stream_response( TypeError: process_chat_completion_stream_response() takes 2 positional arguments but 3 were given

I fixed this by deleting the "request" variable on line 136 to match other inference providers.

Nov 04 '24 18:11 stevegrubb

This can be closed out. The 2 issues discussed here have patches that are merged. Thanks.

Nov 05 '24 14:11 stevegrubb

llama-stack llama-stack copied to clipboard

LLamaGuard, routing, and vllm

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

llama-stack
llama-stack copied to clipboard