llama-stack Add a remote-vllm integration test to GitHub Actions workflow.

🚀 Describe the new functionality needed

Given that vllm has been a very popular choice for inference solution, I would like to suggest we add a remote-vllm integration test to GitHub Actions workflow, maybe test the CPU version of vLLM on 1B/3B model is enough, similar to this PR for adding Ollama test

💡 Why is this needed? What if we don't build it?

vLLM provider maybe broken and many users/companies can not use llama-stack with vLLM.

Other thoughts

will add some inference costs but I believe making sure vLLM provider is working well with llama-stack is very important.

Mar 14 '25 18:03 wukaixingxp

Here's how you can run vllm easy enough:

uv run --with vllm --python 3.12 vllm serve meta-llama/Llama-3.2-3B-Instruct

This probably needs a huggingface token though which has permissions to read the protected llama repository :/

Mar 14 '25 22:03 ashwinb

Couldn't you just use a non-Llama model that doesn't require a HuggingFace token? Or are only Llama models support with the vLLM provider?

Mar 15 '25 02:03 nathan-weinberg

Here's how you can run vllm easy enough:

uv run --with vllm --python 3.12 vllm serve meta-llama/Llama-3.2-3B-Instruct This probably needs a huggingface token though which has permissions to read the protected llama repository :/

Do we have a way to store secrets in the Github action? I wonder how we are testing meta-reference server? as it also need some credential to get our Pytorch weights..

Mar 16 '25 18:03 wukaixingxp

It would be great if the vLLM can be added into the integration test. Right now, from my tests, everything related to the tool calling is not working.

Command to launch vLLM inferencing engine: docker run -d --rm \ --name llamastk_vllm \ --runtime nvidia \ --shm-size 1g \ -p $INFERENCE_PORT:$INFERENCE_PORT \ --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \ --ipc=host vllm/vllm-openai:latest \ --gpu-memory-utilization 0.9 \ --model $INFERENCE_MODEL \ --tensor-parallel-size 1 \ --port 80

Pytest command (assuming host launched at 5000): pytest -s -v inference/test_text_inference.py --stack-config http://localhost:5000 --text-model meta-llama/Llama-3.1-8B-Instruct

Test results showing tool-calling errors:

inference/test_text_inference.py::test_text_completion_non_streaming[txt=8B-inference:completion:sanity] PASSED
inference/test_text_inference.py::test_text_completion_streaming[txt=8B-inference:completion:sanity] PASSED
inference/test_text_inference.py::test_text_completion_log_probs_non_streaming[txt=8B-inference:completion:log_probs] PASSED
inference/test_text_inference.py::test_text_completion_log_probs_streaming[txt=8B-inference:completion:log_probs] PASSED
inference/test_text_inference.py::test_text_completion_structured_output[txt=8B-inference:completion:structured_output] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_01] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_02] PASSED
inference/test_text_inference.py::test_text_chat_completion_first_token_profiling[txt=8B-inference:chat_completion:ttft] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_01] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_02] PASSED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[txt=8B-inference:chat_completion:tool_calling] PASSED
inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] FAILED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] FAILED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] FAILED

Mar 18 '25 18:03 dawenxi-007

Can you paste the errors?

Mar 18 '25 19:03 terrytangyuan

Host side:

ERROR    2025-03-18 19:56:53,582 __main__:195 server: Error executing endpoint route='/v1/inference/chat-completion'
         method='post'
         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:193 in endpoint           │
         │                                                                                                             │
         │   190 │   │   │   │   │   return StreamingResponse(gen, media_type="text/event-stream")                     │
         │   191 │   │   │   │   else:                                                                                 │
         │   192 │   │   │   │   │   value = func(**kwargs)                                                            │
         │ ❱ 193 │   │   │   │   │   return await maybe_await(value)                                                   │
         │   194 │   │   │   except Exception as e:                                                                    │
         │   195 │   │   │   │   logger.exception(f"Error executing endpoint {route=} {method=}")                      │
         │   196 │   │   │   │   raise translate_exception(e) from e                                                   │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:156 in maybe_await        │
         │                                                                                                             │
         │   153                                                                                                       │
         │   154 async def maybe_await(value):                                                                         │
         │   155 │   if inspect.iscoroutine(value):                                                                    │
         │ ❱ 156 │   │   return await value                                                                            │
         │   157 │   return value                                                                                      │
         │   158                                                                                                       │
         │   159                                                                                                       │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in      │
         │ async_wrapper                                                                                               │
         │                                                                                                             │
         │    99 │   │   │                                                                                             │
         │   100 │   │   │   with tracing.span(f"{class_name}.{method_name}", span_attributes) as span:                │
         │   101 │   │   │   │   try:                                                                                  │                                                                                                          [1138/1813]
         │ ❱ 102 │   │   │   │   │   result = await method(self, *args, **kwargs)                                      │
         │   103 │   │   │   │   │   span.set_attribute("output", serialize_value(result))                             │
         │   104 │   │   │   │   │   return result                                                                     │
         │   105 │   │   │   │   except Exception as e:                                                                │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py:316 in chat_completion  │
         │                                                                                                             │
         │   313 │   │   │                                                                                             │
         │   314 │   │   │   return stream_generator()                                                                 │
         │   315 │   │   else:                                                                                         │
         │ ❱ 316 │   │   │   response = await provider.chat_completion(**params)                                       │
         │   317 │   │   │   completion_tokens = await self._count_tokens(                                             │
         │   318 │   │   │   │   [response.completion_message],                                                        │
         │   319 │   │   │   │   tool_config.tool_prompt_format,                                                       │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in      │
         │ async_wrapper                                                                                               │
         │                                                                                                             │
         │    99 │   │   │                                                                                             │
         │   100 │   │   │   with tracing.span(f"{class_name}.{method_name}", span_attributes) as span:                │
         │   101 │   │   │   │   try:                                                                                  │
         │ ❱ 102 │   │   │   │   │   result = await method(self, *args, **kwargs)                                      │
         │   103 │   │   │   │   │   span.set_attribute("output", serialize_value(result))                             │
         │   104 │   │   │   │   │   return result                                                                     │
         │   105 │   │   │   │   except Exception as e:                                                                │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:300 in          │
         │ chat_completion                                                                                             │
         │                                                                                                             │
         │   297 │   │   if stream:                                                                                    │
         │   298 │   │   │   return self._stream_chat_completion(request, self.client)                                 │                                                                                                          [1107/1813]
         │   299 │   │   else:                                                                                         │
         │ ❱ 300 │   │   │   return await self._nonstream_chat_completion(request, self.client)                        │
         │   301 │                                                                                                     │
         │   302 │   async def _nonstream_chat_completion(                                                             │
         │   303 │   │   self, request: ChatCompletionRequest, client: AsyncOpenAI                                     │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:306 in          │
         │ _nonstream_chat_completion                                                                                  │
         │                                                                                                             │
         │   303 │   │   self, request: ChatCompletionRequest, client: AsyncOpenAI                                     │
         │   304 │   ) -> ChatCompletionResponse:                                                                      │
         │   305 │   │   params = await self._get_params(request)                                                      │
         │ ❱ 306 │   │   r = await client.chat.completions.create(**params)                                            │
         │   307 │   │   choice = r.choices[0]                                                                         │
         │   308 │   │   result = ChatCompletionResponse(                                                              │
         │   309 │   │   │   completion_message=CompletionMessage(                                                     │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py:2000 in create     │
         │                                                                                                             │
         │   1997 │   │   timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,                                │
         │   1998 │   ) -> ChatCompletion | AsyncStream[ChatCompletionChunk]:                                          │
         │   1999 │   │   validate_response_format(response_format)                                                    │
         │ ❱ 2000 │   │   return await self._post(                                                                     │
         │   2001 │   │   │   "/chat/completions",                                                                     │
         │   2002 │   │   │   body=await async_maybe_transform(                                                        │
         │   2003 │   │   │   │   {                                                                                    │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1767 in post                                 │
         │                                                                                                             │
         │   1764 │   │   opts = FinalRequestOptions.construct(                                                        │
         │   1765 │   │   │   method="post", url=path, json_data=body, files=await                                     │                                                                                                          [1076/1813]
         │        async_to_httpx_files(files), **options                                                               │
         │   1766 │   │   )                                                                                            │
         │ ❱ 1767 │   │   return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)               │
         │   1768 │                                                                                                    │
         │   1769 │   async def patch(                                                                                 │
         │   1770 │   │   self,                                                                                        │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1461 in request                              │
         │                                                                                                             │
         │   1458 │   │   else:                                                                                        │
         │   1459 │   │   │   retries_taken = 0                                                                        │
         │   1460 │   │                                                                                                │
         │ ❱ 1461 │   │   return await self._request(                                                                  │
         │   1462 │   │   │   cast_to=cast_to,                                                                         │
         │   1463 │   │   │   options=options,                                                                         │
         │   1464 │   │   │   stream=stream,                                                                           │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1562 in _request                             │
         │                                                                                                             │
         │   1559 │   │   │   │   await err.response.aread()                                                           │
         │   1560 │   │   │                                                                                            │
         │   1561 │   │   │   log.debug("Re-raising status error")                                                     │
         │ ❱ 1562 │   │   │   raise self._make_status_error_from_response(err.response) from None                      │
         │   1563 │   │                                                                                                │
         │   1564 │   │   return await self._process_response(                                                         │
         │   1565 │   │   │   cast_to=cast_to,                                                                         │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires
         --enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code':
         400}
INFO:     172.17.0.1:33550 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error
19:56:54.264 [END] /v1/inference/chat-completion [StatusCode.OK] (684.31ms)
 19:56:54.262 [ERROR] Error executing endpoint route='/v1/inference/chat-completion' method='post'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 193, in endpoint
    return await maybe_await(value)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 156, in maybe_await
    return await value
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 316, in chat_completion
    response = await provider.chat_completion(**params)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 300, in chat_completion
    return await self._nonstream_chat_completion(request, self.client)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 306, in _nonstream_chat_completion
    r = await client.chat.completions.create(**params)
  File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2000, in create
    return await self._post(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1767, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1461, in request
    return await self._request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1562, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code': 400}
19:56:54.689 [START] /v1/inference/chat-completion
ERROR    2025-03-18 19:56:54,709 __main__:195 server: Error executing endpoint route='/v1/inference/chat-completion'
         method='post'
         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:193 in endpoint           │
         │                                                                                                             │
         │   190 │   │   │   │   │   return StreamingResponse(gen, media_type="text/event-stream")                     │
         │   191 │   │   │   │   else:                                                                                 │
         │   192 │   │   │   │   │   value = func(**kwargs)                                                            │
         │ ❱ 193 │   │   │   │   │   return await maybe_await(value)                                                   │
         │   194 │   │   │   except Exception as e:                                                                    │
         │   195 │   │   │   │   logger.exception(f"Error executing endpoint {route=} {method=}")                      │
         │   196 │   │   │   │   raise translate_exception(e) from e                                                   │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:156 in maybe_await        │
         │                                                                                                             │
         │   153                                                                                                       │
         │   154 async def maybe_await(value):                                                                         │
         │   155 │   if inspect.iscoroutine(value):                                                                    │
         │ ❱ 156 │   │   return await value                                                                            │
         │   157 │   return value                                                                                      │
         │   158                                                                                                       │
         │   159                                                                                                       │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in      │
         │ async_wrapper                                                                                               │
         │                                                                                                             │
         │    99 │   │   │                                                                                             │
         │   100 │   │   │   with tracing.span(f"{class_name}.{method_name}", span_attributes) as span:                │
         │   101 │   │   │   │   try:                                                                                  │

         │ ❱ 102 │   │   │   │   │   result = await method(self, *args, **kwargs)                                      │
         │   103 │   │   │   │   │   span.set_attribute("output", serialize_value(result))                             │
         │   104 │   │   │   │   │   return result                                                                     │
         │   105 │   │   │   │   except Exception as e:                                                                │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py:316 in chat_completion  │
         │                                                                                                             │
         │   313 │   │   │                                                                                             │
         │   314 │   │   │   return stream_generator()                                                                 │
         │   315 │   │   else:                                                                                         │
         │ ❱ 316 │   │   │   response = await provider.chat_completion(**params)                                       │
         │   317 │   │   │   completion_tokens = await self._count_tokens(                                             │
         │   318 │   │   │   │   [response.completion_message],                                                        │
         │   319 │   │   │   │   tool_config.tool_prompt_format,                                                       │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in      │
         │ async_wrapper                                                                                               │
         │                                                                                                             │
         │    99 │   │   │                                                                                             │
         │   100 │   │   │   with tracing.span(f"{class_name}.{method_name}", span_attributes) as span:                │
         │   101 │   │   │   │   try:                                                                                  │
         │ ❱ 102 │   │   │   │   │   result = await method(self, *args, **kwargs)                                      │
         │   103 │   │   │   │   │   span.set_attribute("output", serialize_value(result))                             │
         │   104 │   │   │   │   │   return result                                                                     │
         │   105 │   │   │   │   except Exception as e:                                                                │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:300 in          │
         │ chat_completion                                                                                             │
         │                                                                                                             │
         │   297 │   │   if stream:                                                                                    │
         │   298 │   │   │   return self._stream_chat_completion(request, self.client)                                 │                                                                                                           [958/1813]
         │   299 │   │   else:                                                                                         │
         │ ❱ 300 │   │   │   return await self._nonstream_chat_completion(request, self.client)                        │
         │   301 │                                                                                                     │
         │   302 │   async def _nonstream_chat_completion(                                                             │
         │   303 │   │   self, request: ChatCompletionRequest, client: AsyncOpenAI                                     │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:306 in          │
         │ _nonstream_chat_completion                                                                                  │
         │                                                                                                             │
         │   303 │   │   self, request: ChatCompletionRequest, client: AsyncOpenAI                                     │
         │   304 │   ) -> ChatCompletionResponse:                                                                      │
         │   305 │   │   params = await self._get_params(request)                                                      │
         │ ❱ 306 │   │   r = await client.chat.completions.create(**params)                                            │
         │   307 │   │   choice = r.choices[0]                                                                         │
         │   308 │   │   result = ChatCompletionResponse(                                                              │
         │   309 │   │   │   completion_message=CompletionMessage(                                                     │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py:2000 in create     │
         │                                                                                                             │
         │   1997 │   │   timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,                                │
         │   1998 │   ) -> ChatCompletion | AsyncStream[ChatCompletionChunk]:                                          │
         │   1999 │   │   validate_response_format(response_format)                                                    │
         │ ❱ 2000 │   │   return await self._post(                                                                     │
         │   2001 │   │   │   "/chat/completions",                                                                     │
         │   2002 │   │   │   body=await async_maybe_transform(                                                        │
         │   2003 │   │   │   │   {                                                                                    │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1767 in post                                 │
         │                                                                                                             │
         │   1764 │   │   opts = FinalRequestOptions.construct(                                                        │
        │   1765 │   │   │   method="post", url=path, json_data=body, files=await                                     │                                                                                                           [927/1813]
         │        async_to_httpx_files(files), **options                                                               │
         │   1766 │   │   )                                                                                            │
         │ ❱ 1767 │   │   return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)               │
         │   1768 │                                                                                                    │
         │   1769 │   async def patch(                                                                                 │
         │   1770 │   │   self,                                                                                        │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1461 in request                              │
         │                                                                                                             │
         │   1458 │   │   else:                                                                                        │
         │   1459 │   │   │   retries_taken = 0                                                                        │
         │   1460 │   │                                                                                                │
         │ ❱ 1461 │   │   return await self._request(                                                                  │
         │   1462 │   │   │   cast_to=cast_to,                                                                         │
         │   1463 │   │   │   options=options,                                                                         │
         │   1464 │   │   │   stream=stream,                                                                           │
         │                                                                                                             │
         │ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1562 in _request                             │
         │                                                                                                             │
         │   1559 │   │   │   │   await err.response.aread()                                                           │
         │   1560 │   │   │                                                                                            │
         │   1561 │   │   │   log.debug("Re-raising status error")                                                     │
         │ ❱ 1562 │   │   │   raise self._make_status_error_from_response(err.response) from None                      │
         │   1563 │   │                                                                                                │
         │   1564 │   │   return await self._process_response(                                                         │
         │   1565 │   │   │   cast_to=cast_to,                                                                         │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires
         --enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code':
         400}
19:56:55.252 [END] /v1/inference/chat-completion [StatusCode.OK] (563.14ms)
 19:56:55.251 [ERROR] Error executing endpoint route='/v1/inference/chat-completion' method='post'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 193, in endpoint
    return await maybe_await(value)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 156, in maybe_await
    return await value
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 316, in chat_completion
    response = await provider.chat_completion(**params)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 300, in chat_completion
    return await self._nonstream_chat_completion(request, self.client)
  File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 306, in _nonstream_chat_completion
    r = await client.chat.completions.create(**params)
  File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2000, in create
    return await self._post(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1767, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1461, in request
    return await self._request(
  File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1562, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code': 400}
...

Client side:

================================================================================================================= FAILURES ==================================================================================================================
_______________________________________________________________ test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] ________________________________________________________________
inference/test_text_inference.py:292: in test_text_chat_completion_with_tool_calling_and_non_streaming
    response = client_with_models.inference.chat_completion(
../../../env/lib/python3.10/site-packages/llama_stack_client/_utils/_utils.py:275: in wrapper
    return func(*args, **kwargs)
../../../env/lib/python3.10/site-packages/llama_stack_client/resources/inference.py:291: in chat_completion
    return self._post(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1225: in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:917: in request
    return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
    return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
    return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
    return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
    return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1020: in _request
    raise self._make_status_error_from_response(err.response) from None
E   llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server error: An unexpected error occurred.'}
_________________________________________________________________ test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] __________________________________________________________________
inference/test_text_inference.py:336: in test_text_chat_completion_with_tool_calling_and_streaming
    tool_invocation_content = extract_tool_invocation_content(response)
inference/test_text_inference.py:313: in extract_tool_invocation_content
    delta = chunk.event.delta                                                                                                                                                                                                        [21/347]
E   AttributeError: 'NoneType' object has no attribute 'delta'
____________________________________________________________________ test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] _____________________________________________________________________
inference/test_text_inference.py:360: in test_text_chat_completion_with_tool_choice_required
    tool_invocation_content = extract_tool_invocation_content(response)
inference/test_text_inference.py:313: in extract_tool_invocation_content
    delta = chunk.event.delta
E   AttributeError: 'NoneType' object has no attribute 'delta'
______________________________________________________________________ test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] ______________________________________________________________________
inference/test_text_inference.py:414: in test_text_chat_completion_structured_output
    answer = AnswerFormat.model_validate_json(response.completion_message.content)
E   pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E     Invalid JSON: EOF while parsing an object at line 8191 column 0 [type=json_invalid, input_value='{   \n\n\n\n\n\n   \n\n\...\n\n\n\n   \n\n\n\n\n\n', input_type=str]
E       For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
_______________________________________________________ test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] ________________________________________________________
inference/test_text_inference.py:450: in test_text_chat_completion_tool_calling_tools_not_in_request
    delta = chunk.event.delta
E   AttributeError: 'NoneType' object has no attribute 'delta'
_______________________________________________________ test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] _______________________________________________________
inference/test_text_inference.py:446: in test_text_chat_completion_tool_calling_tools_not_in_request
    response = client_with_models.inference.chat_completion(**request)
../../../env/lib/python3.10/site-packages/llama_stack_client/_utils/_utils.py:275: in wrapper
    return func(*args, **kwargs)
../../../env/lib/python3.10/site-packages/llama_stack_client/resources/inference.py:291: in chat_completion
    return self._post(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1225: in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:917: in request
    return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
    return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
    return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
    return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
    return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1020: in _request
    raise self._make_status_error_from_response(err.response) from None
E   llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server error: An unexpected error occurred.'}
========================================================================================================== short test summary info ==========================================================================================================
FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] - llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server
 error: An unexpected error occurred.'}
FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] - pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
FAILED inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] - llama_stack_client.InternalServerError: Error code: 500 - {'detail':
 'Internal server error: An unexpected error occurred.'}

It is probably related to how to enable tool use with docker container for vLLM.

Mar 18 '25 20:03 dawenxi-007

Yes you need to enable tool calling as instructed by the error message.

Mar 18 '25 20:03 terrytangyuan

Thanks. I could not find any documentation regarding enabling tool-calling when running vLLM as container. Do you have any instructions that can be helpful?

Mar 19 '25 04:03 dawenxi-007

@dawenxi-007 See https://docs.vllm.ai/en/latest/features/tool_calling.html. I am adding the link to the docs: https://github.com/meta-llama/llama-stack/pull/1719

Mar 20 '25 14:03 terrytangyuan

Thanks but the link does not give the information on the tool calling for deployment with Docker, which I am looking for. The current documentation from vLLM and from Llama Stack do not have any description about it as well.

Mar 20 '25 15:03 dawenxi-007

What's not clear? Maybe I am missing something here. You can add additional args (relevant to tool calling) to your container entrypoint.

Mar 20 '25 15:03 terrytangyuan

Yes, it turned out I need all the 3 following options enabled (there are some option dependencies):

    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --chat-template examples/tool_chat_template_llama3.1_json.jinja

Now for text inferencing testing (haven't verified vision yet), all other tests passed except for the structured output.

______________________________ test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] ______________________________
inference/test_text_inference.py:414: in test_text_chat_completion_structured_output
    answer = AnswerFormat.model_validate_json(response.completion_message.content)
E   pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E     Invalid JSON: EOF while parsing an object at line 8191 column 0 [type=json_invalid, input_value='{   \n\n\n\n\n\n   \n\n\...\n\n\n\n   \n\n\n\n\n\n', i
nput_type=str]
E       For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
================================================================== short test summary info ==================================================================
FAILED inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] - pydantic_core._pyd
antic_core.ValidationError: 1 validation error for AnswerFormat
==================================================== 1 failed, 16 passed, 2 warnings in 69.94s (0:01:09) ====================================================

Mar 20 '25 21:03 dawenxi-007

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

May 20 '25 00:05 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!

Jun 19 '25 00:06 github-actions[bot]

Ok I have a somehow working branch here https://github.com/leseb/llama-stack/pull/7 with some tests failing. I think we could have a nightly job for vLLM only:

Some highlights:

We run vLLM CPU mode
We can only load a 1B model
Somehow the vLLM CPU container build are not working so I built my own from a runner and baked the 1B model in it. Using that image works
It's flaky from time to time but that's due to runner's error it seems

I think it would be best to narrow down the vLLM integration tests to the most meaningful ones.

@bbrowning @ashwinb @terrytangyuan thoughts on that?

Jun 19 '25 08:06 leseb

@leseb I'd love to get regular test signals from vLLM (or other providers, but especially vLLM). I'd prefer to run the tests with a "real" model, which requires GPU. If the choice is between no vLLM testing or vLLM nightly testing with CPU, nightly with CPU is definitely better than nothing.

Perhaps a nightly with CPU is a decent stop-gap for now. I could get us setup with some nightly vLLM testing of latest Llama Stack main in a separate GitHub org, where I have access to secrets, runners, and budget that can run real models in vLLM. I actually did this for some of the OpenAI API verification tests for quite a while - not just with vLLM but also with a number of our SaaS inference providers.

Regardless of the method (in-tree nightly on CPU vs external nightly with real GPUs), getting value from this will hinge on how we use the results. Nightly jobs aren't PR-gating, so how will we respond to and fix failures? Will there be some place we can check before cutting a release to know which providers are passing tests and which aren't?

Jun 20 '25 15:06 bbrowning

@leseb I'd love to get regular test signals from vLLM (or other providers, but especially vLLM). I'd prefer to run the tests with a "real" model, which requires GPU. If the choice is between no vLLM testing or vLLM nightly testing with CPU, nightly with CPU is definitely better than nothing.

Perhaps a nightly with CPU is a decent stop-gap for now. I could get us setup with some nightly vLLM testing of latest Llama Stack main in a separate GitHub org, where I have access to secrets, runners, and budget that can run real models in vLLM. I actually did this for some of the OpenAI API verification tests for quite a while - not just with vLLM but also with a number of our SaaS inference providers.

Regardless of the method (in-tree nightly on CPU vs external nightly with real GPUs), getting value from this will hinge on how we use the results. Nightly jobs aren't PR-gating, so how will we respond to and fix failures? Will there be some place we can check before cutting a release to know which providers are passing tests and which aren't?

I spoke with @derekhiggins and he is going to take a look at a nightly job. Starting CPU first, and hopefully GPU on day. I believe maintainers will receive emails from GitHub if the nightly fails. We just need someone to take ownership of this CI job and fix it when broken. Perhaps @derekhiggins again? :)

Jul 09 '25 09:07 leseb

Happy to take ownership of this one

Jul 09 '25 09:07 derekhiggins

I've added a PR to add vllm to the current integration jobs, can you take a look when possible.

Jul 16 '25 08:07 derekhiggins