Add a remote-vllm integration test to GitHub Actions workflow.
🚀 Describe the new functionality needed
Given that vllm has been a very popular choice for inference solution, I would like to suggest we add a remote-vllm integration test to GitHub Actions workflow, maybe test the CPU version of vLLM on 1B/3B model is enough, similar to this PR for adding Ollama test
💡 Why is this needed? What if we don't build it?
vLLM provider maybe broken and many users/companies can not use llama-stack with vLLM.
Other thoughts
will add some inference costs but I believe making sure vLLM provider is working well with llama-stack is very important.
Here's how you can run vllm easy enough:
uv run --with vllm --python 3.12 vllm serve meta-llama/Llama-3.2-3B-Instruct
This probably needs a huggingface token though which has permissions to read the protected llama repository :/
Couldn't you just use a non-Llama model that doesn't require a HuggingFace token? Or are only Llama models support with the vLLM provider?
Here's how you can run
vllmeasy enough:uv run --with vllm --python 3.12 vllm serve meta-llama/Llama-3.2-3B-Instruct This probably needs a huggingface token though which has permissions to read the protected llama repository :/
Do we have a way to store secrets in the Github action? I wonder how we are testing meta-reference server? as it also need some credential to get our Pytorch weights..
It would be great if the vLLM can be added into the integration test. Right now, from my tests, everything related to the tool calling is not working.
Command to launch vLLM inferencing engine:
docker run -d --rm \ --name llamastk_vllm \ --runtime nvidia \ --shm-size 1g \ -p $INFERENCE_PORT:$INFERENCE_PORT \ --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN" \ --ipc=host vllm/vllm-openai:latest \ --gpu-memory-utilization 0.9 \ --model $INFERENCE_MODEL \ --tensor-parallel-size 1 \ --port 80
Pytest command (assuming host launched at 5000):
pytest -s -v inference/test_text_inference.py --stack-config http://localhost:5000 --text-model meta-llama/Llama-3.1-8B-Instruct
Test results showing tool-calling errors:
inference/test_text_inference.py::test_text_completion_non_streaming[txt=8B-inference:completion:sanity] PASSED
inference/test_text_inference.py::test_text_completion_streaming[txt=8B-inference:completion:sanity] PASSED
inference/test_text_inference.py::test_text_completion_log_probs_non_streaming[txt=8B-inference:completion:log_probs] PASSED
inference/test_text_inference.py::test_text_completion_log_probs_streaming[txt=8B-inference:completion:log_probs] PASSED
inference/test_text_inference.py::test_text_completion_structured_output[txt=8B-inference:completion:structured_output] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_01] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_02] PASSED
inference/test_text_inference.py::test_text_chat_completion_first_token_profiling[txt=8B-inference:chat_completion:ttft] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_01] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_02] PASSED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[txt=8B-inference:chat_completion:tool_calling] PASSED
inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] FAILED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] FAILED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] FAILED
Can you paste the errors?
Host side:
ERROR 2025-03-18 19:56:53,582 __main__:195 server: Error executing endpoint route='/v1/inference/chat-completion'
method='post'
╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:193 in endpoint │
│ │
│ 190 │ │ │ │ │ return StreamingResponse(gen, media_type="text/event-stream") │
│ 191 │ │ │ │ else: │
│ 192 │ │ │ │ │ value = func(**kwargs) │
│ ❱ 193 │ │ │ │ │ return await maybe_await(value) │
│ 194 │ │ │ except Exception as e: │
│ 195 │ │ │ │ logger.exception(f"Error executing endpoint {route=} {method=}") │
│ 196 │ │ │ │ raise translate_exception(e) from e │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:156 in maybe_await │
│ │
│ 153 │
│ 154 async def maybe_await(value): │
│ 155 │ if inspect.iscoroutine(value): │
│ ❱ 156 │ │ return await value │
│ 157 │ return value │
│ 158 │
│ 159 │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in │
│ async_wrapper │
│ │
│ 99 │ │ │ │
│ 100 │ │ │ with tracing.span(f"{class_name}.{method_name}", span_attributes) as span: │
│ 101 │ │ │ │ try: │ [1138/1813]
│ ❱ 102 │ │ │ │ │ result = await method(self, *args, **kwargs) │
│ 103 │ │ │ │ │ span.set_attribute("output", serialize_value(result)) │
│ 104 │ │ │ │ │ return result │
│ 105 │ │ │ │ except Exception as e: │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py:316 in chat_completion │
│ │
│ 313 │ │ │ │
│ 314 │ │ │ return stream_generator() │
│ 315 │ │ else: │
│ ❱ 316 │ │ │ response = await provider.chat_completion(**params) │
│ 317 │ │ │ completion_tokens = await self._count_tokens( │
│ 318 │ │ │ │ [response.completion_message], │
│ 319 │ │ │ │ tool_config.tool_prompt_format, │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in │
│ async_wrapper │
│ │
│ 99 │ │ │ │
│ 100 │ │ │ with tracing.span(f"{class_name}.{method_name}", span_attributes) as span: │
│ 101 │ │ │ │ try: │
│ ❱ 102 │ │ │ │ │ result = await method(self, *args, **kwargs) │
│ 103 │ │ │ │ │ span.set_attribute("output", serialize_value(result)) │
│ 104 │ │ │ │ │ return result │
│ 105 │ │ │ │ except Exception as e: │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:300 in │
│ chat_completion │
│ │
│ 297 │ │ if stream: │
│ 298 │ │ │ return self._stream_chat_completion(request, self.client) │ [1107/1813]
│ 299 │ │ else: │
│ ❱ 300 │ │ │ return await self._nonstream_chat_completion(request, self.client) │
│ 301 │ │
│ 302 │ async def _nonstream_chat_completion( │
│ 303 │ │ self, request: ChatCompletionRequest, client: AsyncOpenAI │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:306 in │
│ _nonstream_chat_completion │
│ │
│ 303 │ │ self, request: ChatCompletionRequest, client: AsyncOpenAI │
│ 304 │ ) -> ChatCompletionResponse: │
│ 305 │ │ params = await self._get_params(request) │
│ ❱ 306 │ │ r = await client.chat.completions.create(**params) │
│ 307 │ │ choice = r.choices[0] │
│ 308 │ │ result = ChatCompletionResponse( │
│ 309 │ │ │ completion_message=CompletionMessage( │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py:2000 in create │
│ │
│ 1997 │ │ timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN, │
│ 1998 │ ) -> ChatCompletion | AsyncStream[ChatCompletionChunk]: │
│ 1999 │ │ validate_response_format(response_format) │
│ ❱ 2000 │ │ return await self._post( │
│ 2001 │ │ │ "/chat/completions", │
│ 2002 │ │ │ body=await async_maybe_transform( │
│ 2003 │ │ │ │ { │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1767 in post │
│ │
│ 1764 │ │ opts = FinalRequestOptions.construct( │
│ 1765 │ │ │ method="post", url=path, json_data=body, files=await │ [1076/1813]
│ async_to_httpx_files(files), **options │
│ 1766 │ │ ) │
│ ❱ 1767 │ │ return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) │
│ 1768 │ │
│ 1769 │ async def patch( │
│ 1770 │ │ self, │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1461 in request │
│ │
│ 1458 │ │ else: │
│ 1459 │ │ │ retries_taken = 0 │
│ 1460 │ │ │
│ ❱ 1461 │ │ return await self._request( │
│ 1462 │ │ │ cast_to=cast_to, │
│ 1463 │ │ │ options=options, │
│ 1464 │ │ │ stream=stream, │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1562 in _request │
│ │
│ 1559 │ │ │ │ await err.response.aread() │
│ 1560 │ │ │ │
│ 1561 │ │ │ log.debug("Re-raising status error") │
│ ❱ 1562 │ │ │ raise self._make_status_error_from_response(err.response) from None │
│ 1563 │ │ │
│ 1564 │ │ return await self._process_response( │
│ 1565 │ │ │ cast_to=cast_to, │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires
--enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code':
400}
INFO: 172.17.0.1:33550 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error
19:56:54.264 [END] /v1/inference/chat-completion [StatusCode.OK] (684.31ms)
19:56:54.262 [ERROR] Error executing endpoint route='/v1/inference/chat-completion' method='post'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 193, in endpoint
return await maybe_await(value)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 156, in maybe_await
return await value
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 316, in chat_completion
response = await provider.chat_completion(**params)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 300, in chat_completion
return await self._nonstream_chat_completion(request, self.client)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 306, in _nonstream_chat_completion
r = await client.chat.completions.create(**params)
File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2000, in create
return await self._post(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1767, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1461, in request
return await self._request(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1562, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code': 400}
19:56:54.689 [START] /v1/inference/chat-completion
ERROR 2025-03-18 19:56:54,709 __main__:195 server: Error executing endpoint route='/v1/inference/chat-completion'
method='post'
╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:193 in endpoint │
│ │
│ 190 │ │ │ │ │ return StreamingResponse(gen, media_type="text/event-stream") │
│ 191 │ │ │ │ else: │
│ 192 │ │ │ │ │ value = func(**kwargs) │
│ ❱ 193 │ │ │ │ │ return await maybe_await(value) │
│ 194 │ │ │ except Exception as e: │
│ 195 │ │ │ │ logger.exception(f"Error executing endpoint {route=} {method=}") │
│ 196 │ │ │ │ raise translate_exception(e) from e │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py:156 in maybe_await │
│ │
│ 153 │
│ 154 async def maybe_await(value): │
│ 155 │ if inspect.iscoroutine(value): │
│ ❱ 156 │ │ return await value │
│ 157 │ return value │
│ 158 │
│ 159 │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in │
│ async_wrapper │
│ │
│ 99 │ │ │ │
│ 100 │ │ │ with tracing.span(f"{class_name}.{method_name}", span_attributes) as span: │
│ 101 │ │ │ │ try: │
│ ❱ 102 │ │ │ │ │ result = await method(self, *args, **kwargs) │
│ 103 │ │ │ │ │ span.set_attribute("output", serialize_value(result)) │
│ 104 │ │ │ │ │ return result │
│ 105 │ │ │ │ except Exception as e: │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py:316 in chat_completion │
│ │
│ 313 │ │ │ │
│ 314 │ │ │ return stream_generator() │
│ 315 │ │ else: │
│ ❱ 316 │ │ │ response = await provider.chat_completion(**params) │
│ 317 │ │ │ completion_tokens = await self._count_tokens( │
│ 318 │ │ │ │ [response.completion_message], │
│ 319 │ │ │ │ tool_config.tool_prompt_format, │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py:102 in │
│ async_wrapper │
│ │
│ 99 │ │ │ │
│ 100 │ │ │ with tracing.span(f"{class_name}.{method_name}", span_attributes) as span: │
│ 101 │ │ │ │ try: │
│ ❱ 102 │ │ │ │ │ result = await method(self, *args, **kwargs) │
│ 103 │ │ │ │ │ span.set_attribute("output", serialize_value(result)) │
│ 104 │ │ │ │ │ return result │
│ 105 │ │ │ │ except Exception as e: │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:300 in │
│ chat_completion │
│ │
│ 297 │ │ if stream: │
│ 298 │ │ │ return self._stream_chat_completion(request, self.client) │ [958/1813]
│ 299 │ │ else: │
│ ❱ 300 │ │ │ return await self._nonstream_chat_completion(request, self.client) │
│ 301 │ │
│ 302 │ async def _nonstream_chat_completion( │
│ 303 │ │ self, request: ChatCompletionRequest, client: AsyncOpenAI │
│ │
│ /usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py:306 in │
│ _nonstream_chat_completion │
│ │
│ 303 │ │ self, request: ChatCompletionRequest, client: AsyncOpenAI │
│ 304 │ ) -> ChatCompletionResponse: │
│ 305 │ │ params = await self._get_params(request) │
│ ❱ 306 │ │ r = await client.chat.completions.create(**params) │
│ 307 │ │ choice = r.choices[0] │
│ 308 │ │ result = ChatCompletionResponse( │
│ 309 │ │ │ completion_message=CompletionMessage( │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py:2000 in create │
│ │
│ 1997 │ │ timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN, │
│ 1998 │ ) -> ChatCompletion | AsyncStream[ChatCompletionChunk]: │
│ 1999 │ │ validate_response_format(response_format) │
│ ❱ 2000 │ │ return await self._post( │
│ 2001 │ │ │ "/chat/completions", │
│ 2002 │ │ │ body=await async_maybe_transform( │
│ 2003 │ │ │ │ { │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1767 in post │
│ │
│ 1764 │ │ opts = FinalRequestOptions.construct( │
│ 1765 │ │ │ method="post", url=path, json_data=body, files=await │ [927/1813]
│ async_to_httpx_files(files), **options │
│ 1766 │ │ ) │
│ ❱ 1767 │ │ return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) │
│ 1768 │ │
│ 1769 │ async def patch( │
│ 1770 │ │ self, │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1461 in request │
│ │
│ 1458 │ │ else: │
│ 1459 │ │ │ retries_taken = 0 │
│ 1460 │ │ │
│ ❱ 1461 │ │ return await self._request( │
│ 1462 │ │ │ cast_to=cast_to, │
│ 1463 │ │ │ options=options, │
│ 1464 │ │ │ stream=stream, │
│ │
│ /usr/local/lib/python3.10/site-packages/openai/_base_client.py:1562 in _request │
│ │
│ 1559 │ │ │ │ await err.response.aread() │
│ 1560 │ │ │ │
│ 1561 │ │ │ log.debug("Re-raising status error") │
│ ❱ 1562 │ │ │ raise self._make_status_error_from_response(err.response) from None │
│ 1563 │ │ │
│ 1564 │ │ return await self._process_response( │
│ 1565 │ │ │ cast_to=cast_to, │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires
--enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code':
400}
19:56:55.252 [END] /v1/inference/chat-completion [StatusCode.OK] (563.14ms)
19:56:55.251 [ERROR] Error executing endpoint route='/v1/inference/chat-completion' method='post'
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 193, in endpoint
return await maybe_await(value)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 156, in maybe_await
return await value
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 316, in chat_completion
response = await provider.chat_completion(**params)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 300, in chat_completion
return await self._nonstream_chat_completion(request, self.client)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 306, in _nonstream_chat_completion
r = await client.chat.completions.create(**params)
File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 2000, in create
return await self._post(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1767, in post
return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1461, in request
return await self._request(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1562, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': '"auto" tool choice requires --enable-auto-tool-choice and --tool-call-parser to be set', 'type': 'BadRequestError', 'param': None, 'code': 400}
...
Client side:
================================================================================================================= FAILURES ==================================================================================================================
_______________________________________________________________ test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] ________________________________________________________________
inference/test_text_inference.py:292: in test_text_chat_completion_with_tool_calling_and_non_streaming
response = client_with_models.inference.chat_completion(
../../../env/lib/python3.10/site-packages/llama_stack_client/_utils/_utils.py:275: in wrapper
return func(*args, **kwargs)
../../../env/lib/python3.10/site-packages/llama_stack_client/resources/inference.py:291: in chat_completion
return self._post(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1225: in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:917: in request
return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1020: in _request
raise self._make_status_error_from_response(err.response) from None
E llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server error: An unexpected error occurred.'}
_________________________________________________________________ test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] __________________________________________________________________
inference/test_text_inference.py:336: in test_text_chat_completion_with_tool_calling_and_streaming
tool_invocation_content = extract_tool_invocation_content(response)
inference/test_text_inference.py:313: in extract_tool_invocation_content
delta = chunk.event.delta [21/347]
E AttributeError: 'NoneType' object has no attribute 'delta'
____________________________________________________________________ test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] _____________________________________________________________________
inference/test_text_inference.py:360: in test_text_chat_completion_with_tool_choice_required
tool_invocation_content = extract_tool_invocation_content(response)
inference/test_text_inference.py:313: in extract_tool_invocation_content
delta = chunk.event.delta
E AttributeError: 'NoneType' object has no attribute 'delta'
______________________________________________________________________ test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] ______________________________________________________________________
inference/test_text_inference.py:414: in test_text_chat_completion_structured_output
answer = AnswerFormat.model_validate_json(response.completion_message.content)
E pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E Invalid JSON: EOF while parsing an object at line 8191 column 0 [type=json_invalid, input_value='{ \n\n\n\n\n\n \n\n\...\n\n\n\n \n\n\n\n\n\n', input_type=str]
E For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
_______________________________________________________ test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] ________________________________________________________
inference/test_text_inference.py:450: in test_text_chat_completion_tool_calling_tools_not_in_request
delta = chunk.event.delta
E AttributeError: 'NoneType' object has no attribute 'delta'
_______________________________________________________ test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] _______________________________________________________
inference/test_text_inference.py:446: in test_text_chat_completion_tool_calling_tools_not_in_request
response = client_with_models.inference.chat_completion(**request)
../../../env/lib/python3.10/site-packages/llama_stack_client/_utils/_utils.py:275: in wrapper
return func(*args, **kwargs)
../../../env/lib/python3.10/site-packages/llama_stack_client/resources/inference.py:291: in chat_completion
return self._post(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1225: in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:917: in request
return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1005: in _request
return self._retry_request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1054: in _retry_request
return self._request(
../../../env/lib/python3.10/site-packages/llama_stack_client/_base_client.py:1020: in _request
raise self._make_status_error_from_response(err.response) from None
E llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server error: An unexpected error occurred.'}
========================================================================================================== short test summary info ==========================================================================================================
FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] - llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server
error: An unexpected error occurred.'}
FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] - pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
FAILED inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] - llama_stack_client.InternalServerError: Error code: 500 - {'detail':
'Internal server error: An unexpected error occurred.'}
It is probably related to how to enable tool use with docker container for vLLM.
Yes you need to enable tool calling as instructed by the error message.
Thanks. I could not find any documentation regarding enabling tool-calling when running vLLM as container. Do you have any instructions that can be helpful?
@dawenxi-007 See https://docs.vllm.ai/en/latest/features/tool_calling.html. I am adding the link to the docs: https://github.com/meta-llama/llama-stack/pull/1719
Thanks but the link does not give the information on the tool calling for deployment with Docker, which I am looking for. The current documentation from vLLM and from Llama Stack do not have any description about it as well.
What's not clear? Maybe I am missing something here. You can add additional args (relevant to tool calling) to your container entrypoint.
Yes, it turned out I need all the 3 following options enabled (there are some option dependencies):
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--chat-template examples/tool_chat_template_llama3.1_json.jinja
Now for text inferencing testing (haven't verified vision yet), all other tests passed except for the structured output.
______________________________ test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] ______________________________
inference/test_text_inference.py:414: in test_text_chat_completion_structured_output
answer = AnswerFormat.model_validate_json(response.completion_message.content)
E pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E Invalid JSON: EOF while parsing an object at line 8191 column 0 [type=json_invalid, input_value='{ \n\n\n\n\n\n \n\n\...\n\n\n\n \n\n\n\n\n\n', i
nput_type=str]
E For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
================================================================== short test summary info ==================================================================
FAILED inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] - pydantic_core._pyd
antic_core.ValidationError: 1 validation error for AnswerFormat
==================================================== 1 failed, 16 passed, 2 warnings in 69.94s (0:01:09) ====================================================
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant!
Ok I have a somehow working branch here https://github.com/leseb/llama-stack/pull/7 with some tests failing. I think we could have a nightly job for vLLM only:
Some highlights:
- We run vLLM CPU mode
- We can only load a 1B model
- Somehow the vLLM CPU container build are not working so I built my own from a runner and baked the 1B model in it. Using that image works
- It's flaky from time to time but that's due to runner's error it seems
I think it would be best to narrow down the vLLM integration tests to the most meaningful ones.
@bbrowning @ashwinb @terrytangyuan thoughts on that?
@leseb I'd love to get regular test signals from vLLM (or other providers, but especially vLLM). I'd prefer to run the tests with a "real" model, which requires GPU. If the choice is between no vLLM testing or vLLM nightly testing with CPU, nightly with CPU is definitely better than nothing.
Perhaps a nightly with CPU is a decent stop-gap for now. I could get us setup with some nightly vLLM testing of latest Llama Stack main in a separate GitHub org, where I have access to secrets, runners, and budget that can run real models in vLLM. I actually did this for some of the OpenAI API verification tests for quite a while - not just with vLLM but also with a number of our SaaS inference providers.
Regardless of the method (in-tree nightly on CPU vs external nightly with real GPUs), getting value from this will hinge on how we use the results. Nightly jobs aren't PR-gating, so how will we respond to and fix failures? Will there be some place we can check before cutting a release to know which providers are passing tests and which aren't?
@leseb I'd love to get regular test signals from vLLM (or other providers, but especially vLLM). I'd prefer to run the tests with a "real" model, which requires GPU. If the choice is between no vLLM testing or vLLM nightly testing with CPU, nightly with CPU is definitely better than nothing.
Perhaps a nightly with CPU is a decent stop-gap for now. I could get us setup with some nightly vLLM testing of latest Llama Stack main in a separate GitHub org, where I have access to secrets, runners, and budget that can run real models in vLLM. I actually did this for some of the OpenAI API verification tests for quite a while - not just with vLLM but also with a number of our SaaS inference providers.
Regardless of the method (in-tree nightly on CPU vs external nightly with real GPUs), getting value from this will hinge on how we use the results. Nightly jobs aren't PR-gating, so how will we respond to and fix failures? Will there be some place we can check before cutting a release to know which providers are passing tests and which aren't?
I spoke with @derekhiggins and he is going to take a look at a nightly job. Starting CPU first, and hopefully GPU on day. I believe maintainers will receive emails from GitHub if the nightly fails. We just need someone to take ownership of this CI job and fix it when broken. Perhaps @derekhiggins again? :)
Happy to take ownership of this one
I've added a PR to add vllm to the current integration jobs, can you take a look when possible.