Responses RAG works with one model but not another
System Info
llama_stack==0.2.17
llama_stack_client==0.2.17
Information
- [ ] The official example scripts
- [x] My own modified scripts
🐛 Describe the bug
When I try to use file_search with the Llama 3.3 model in the Llama API, I get BadRequestError 400 "Input should be a valid string" see full stack track and server log error below. The same code works fine when I use gpt-4o as the model.
Here is the code:
import requests
from pathlib import Path
# Download a sample PDF for demonstration
def download_sample_pdf(url: str, filename: str) -> str:
"""Download a PDF from URL and save it locally"""
print(f"Downloading PDF from: {url}")
response = requests.get(url)
response.raise_for_status()
filepath = Path(filename)
with open(filepath, 'wb') as f:
f.write(response.content)
print(f"PDF saved as: {filepath}")
return str(filepath)
pdf_url = "https://www.nps.gov/aboutus/upload/NPIndex2012-2016.pdf"
pdf_path = download_sample_pdf(pdf_url, "NPIndex2012-2016.pdf")
pdf_title = "The National Parks: Index 2012-2016"
import uuid
vector_store_name= f"vec_{str(uuid.uuid4())[0:8]}"
vector_store = client.vector_stores.create(name=vector_store_name)
vector_store_id = vector_store.id
file_create_response = client.files.create(file=Path(pdf_path), purpose="assistants")
file_ingest_response = client.vector_stores.files.create(
vector_store_id=vector_store_id,
file_id=file_create_response.id,
)
rag_llama_stack_client_response = client.responses.create(
model=LLAMA_STACK_MODEL_IDS[2],
input="When did the Bering Land Bridge become a national preserve?",
tools=[
{
"type": "file_search",
"vector_store_ids": [vector_store_id],
}
]
)
Note, that if I don't do RAG and just call responses with no tools, then both models work fine:
client.chat.completions.create(
model=LLAMA_STACK_MODEL_ID,
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
Error logs
Client:
---------------------------------------------------------------------------
BadRequestError Traceback (most recent call last)
Cell In[17], line 1
----> 1 rag_llama_stack_client_response = client.responses.create(
2 model=LLAMA_STACK_MODEL_IDS[2],
3 input="When did the Bering Land Bridge become a national preserve?",
4 tools=[
5 {
6 "type": "file_search",
7 "vector_store_ids": [vector_store_id],
8 }
9 ]
10 )
12 rag_llama_stack_client_response
File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/_utils/_utils.py:283, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
281 msg = f"Missing required argument: {quote(missing[0])}"
282 raise TypeError(msg)
--> 283 return func(*args, **kwargs)
File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/resources/responses/responses.py:212, in ResponsesResource.create(self, input, model, instructions, max_infer_iters, previous_response_id, store, stream, temperature, text, tools, extra_headers, extra_query, extra_body, timeout)
191 @required_args(["input", "model"], ["input", "model", "stream"])
192 def create(
193 self,
(...) 210 timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
211 ) -> ResponseObject | Stream[ResponseObjectStream]:
--> 212 return self._post(
213 "/v1/openai/v1/responses",
214 body=maybe_transform(
215 {
216 "input": input,
217 "model": model,
218 "instructions": instructions,
219 "max_infer_iters": max_infer_iters,
220 "previous_response_id": previous_response_id,
221 "store": store,
222 "stream": stream,
223 "temperature": temperature,
224 "text": text,
225 "tools": tools,
226 },
227 response_create_params.ResponseCreateParamsStreaming
228 if stream
229 else response_create_params.ResponseCreateParamsNonStreaming,
230 ),
231 options=make_request_options(
232 extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
233 ),
234 cast_to=ResponseObject,
235 stream=stream or False,
236 stream_cls=Stream[ResponseObjectStream],
237 )
File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/_base_client.py:1232, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
1218 def post(
1219 self,
1220 path: str,
(...) 1227 stream_cls: type[_StreamT] | None = None,
1228 ) -> ResponseT | _StreamT:
1229 opts = FinalRequestOptions.construct(
1230 method="post", url=path, json_data=body, files=to_httpx_files(files), **options
1231 )
-> 1232 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/_base_client.py:1034, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
1031 err.response.read()
1033 log.debug("Re-raising status error")
-> 1034 raise self._make_status_error_from_response(err.response) from None
1036 break
1038 assert response is not None, "could not resolve response (should never happen)"
BadRequestError: Error code: 400 - {'detail': {'errors': [{'loc': ['finish_reason'], 'msg': 'Input should be a valid string', 'type': 'string_type'}]}}
Server:
INFO 2025-08-15 15:31:06,857 console_span_processor:28 telemetry: 19:31:06.857 [START] /v1/openai/v1/responses
ERROR 2025-08-15 15:31:07,394 __main__:244 server: Error executing endpoint route='/v1/openai/v1/responses' method='post': 1 validation error for
OpenAIChoice
finish_reason
Input should be a valid string
For further information visit https://errors.pydantic.dev/2.11/v/string_type
INFO 2025-08-15 15:31:07,399 uvicorn.access:473 uncategorized: ::1:58064 - "POST /v1/openai/v1/responses HTTP/1.1" 400
INFO 2025-08-15 15:31:07,407 console_span_processor:42 telemetry: 19:31:07.404 [END] /v1/openai/v1/responses [StatusCode.OK] (546.56ms)
INFO 2025-08-15 15:31:07,409 console_span_processor:65 telemetry: 19:31:07.399 [ERROR] Error executing endpoint route='/v1/openai/v1/responses'
method='post': 1 validation error for OpenAIChoice
finish_reason
Input should be a valid string
For further information visit https://errors.pydantic.dev/2.11/v/string_type
INFO 2025-08-15 15:31:07,412 console_span_processor:65 telemetry: 19:31:07.401 [INFO] ::1:58064 - "POST /v1/openai/v1/responses HTTP/1.1" 400
Expected behavior
RAG shouldn't crash with either model. Maybe you get a better answer from one or the other.
Thanks for the detailed repro. The 400 “input should be a valid string” usually comes from Llama-Stack’s request validator: that route expects input: str.
When you enable RAG you’re likely passing a non-string (e.g., a dict/list of messages, or bytes after concatenating chunks). Some models/routes tolerate messages, others are strict about input: str, which explains why one model works and the other fails.
Quick checks
-
Right before the call, log the type:
assert isinstance(final_prompt, str). -
If you’re building messages, either:
-
Flatten to a single prompt string, e.g.
prompt = f"{question}\n\nContext:\n{retrieved_text}" client.responses.create(model="llama-3.3", input=prompt) -
Or, if the model/route supports chat format, switch to the chat schema that route expects (e.g.,
messages=[{"role":"user","content": prompt}]) instead ofinput.
-
-
Make sure no bytes sneak in from file reads/OCR (
.decode("utf-8")if needed).
If you share the exact responses.create(...) payload for the failing case, I can point to the precise field that needs to be converted.
@onestardao thank you for looking into this! Based on the error message, I believe you are right that there in a non-string where a string should be, but I believe whatever logic is doing that is internal to Llama Stack. The Responses API, unlike most other APIs for getting answers from models, can do agentic reasoning with multiple turns. To be more clear and explicit about what's going on, I updated my sample code to the following:
rag_query = "When did the Bering Land Bridge become a national preserve?"
rag_model = "llama-openai-compat/Llama-3.3-70B-Instruct"
assert isinstance(rag_query, str)
assert isinstance(rag_model, str)
rag_llama_stack_client_response = client.responses.create(
model=rag_model,
input=rag_query,
tools=[
{
"type": "file_search",
"vector_store_ids": [vector_store_id],
}
]
)
I still get the same error:
BadRequestError: Error code: 400 - {'detail': {'errors': [{'loc': ['finish_reason'], 'msg': 'Input should be a valid string', 'type': 'string_type'}]}}
As I understand it, what's going on here is that this call causes Llama Stack to first prompt the model with the query and the list of tools (in this case, just the file search tool), then if a tool is selected execute it, then re-prompt the model with the outputs of that tool, and then if a final response is returned to the user, send it back to the user. If I had to guess, I would say that maybe something about the provider or the model for the first model call resulted in some sort of invalid state that lead Llama Stack to produce a non-string for the second model call, but I could be wrong about that.
For reference, when I call this same code except with "openai/gpt-4o" as the model, the rag_llama_stack_client_response has the following structure:
ID: resp-86c165e6-07cb-4cc4-9af2-38d8eb5bc4c8
Status: completed
Model: openai/gpt-3.5-turbo
Created at: 1755287351
Output items: 2
--- Output Item 1 ---
Output type: file_search_call
Tool Call ID: call_7n7wK9E6Lc7eMOI89p2EKKaV
Tool Status: completed
Queries: Bering Land Bridge National Preserve establishment date
Results: [{'file_id': '', 'filename': '', 'text': ' for abundant wildlife and sport\n \nfishing for five species of salmon.\n \nEstablished Dec. 2, 1980. <THERE WAS A LOT MORE TEXT HERE>', 'score': 2.447602654616083}]
--- Output Item 2 ---
Output type: message
Response content: The Bering Land Bridge National Preserve was established on December 2, 1980.
Here you can see the multi-step nature of the RAG process for this API.
This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.
Is this still an issue @jwm4?
I'm not sure. I can try to replicate it using the latest Llama Stack version at some point if that would be helpful.
Yes please! But at your leisure, this isn't anything urgent, just seeing what bugs we can possibly close out
The bug is at streaming.py:573 where chunk_finish_reason = "" is initialized. When streaming chunks don't provide a finish_reason (might be related to with Llama providers), this empty string fails OpenAI SDK validation since finish_reason must be a Literal["stop", "length", "tool_calls", "content_filter", "function_call"].
However I was not able to reproduce it with ollama with llama3.2:1b model, as ollama correctly returns finish_reason='stop' in the final chunk.
I think we should initialize chunk_finish_reason to 'stop' instead of empty string. @jwm4 can you check again vs llama API if this still happens.