llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Responses RAG works with one model but not another

Open jwm4 opened this issue 5 months ago • 7 comments

System Info

llama_stack==0.2.17
llama_stack_client==0.2.17

Information

  • [ ] The official example scripts
  • [x] My own modified scripts

🐛 Describe the bug

When I try to use file_search with the Llama 3.3 model in the Llama API, I get BadRequestError 400 "Input should be a valid string" see full stack track and server log error below. The same code works fine when I use gpt-4o as the model.

Here is the code:

import requests
from pathlib import Path

# Download a sample PDF for demonstration
def download_sample_pdf(url: str, filename: str) -> str:
    """Download a PDF from URL and save it locally"""
    print(f"Downloading PDF from: {url}")
    response = requests.get(url)
    response.raise_for_status()
    
    filepath = Path(filename)
    with open(filepath, 'wb') as f:
        f.write(response.content)
    
    print(f"PDF saved as: {filepath}")
    return str(filepath)

pdf_url = "https://www.nps.gov/aboutus/upload/NPIndex2012-2016.pdf"
pdf_path = download_sample_pdf(pdf_url, "NPIndex2012-2016.pdf")
pdf_title = "The National Parks: Index 2012-2016"

import uuid

vector_store_name= f"vec_{str(uuid.uuid4())[0:8]}"

vector_store = client.vector_stores.create(name=vector_store_name)
vector_store_id = vector_store.id

file_create_response = client.files.create(file=Path(pdf_path), purpose="assistants")
file_ingest_response = client.vector_stores.files.create(
    vector_store_id=vector_store_id,
    file_id=file_create_response.id,
)
rag_llama_stack_client_response = client.responses.create(
    model=LLAMA_STACK_MODEL_IDS[2],
    input="When did the Bering Land Bridge become a national preserve?",
    tools=[
        {
            "type": "file_search",
            "vector_store_ids": [vector_store_id],
        }
    ]
)

Note, that if I don't do RAG and just call responses with no tools, then both models work fine:

client.chat.completions.create(
    model=LLAMA_STACK_MODEL_ID,
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)

Error logs

Client:

---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
Cell In[17], line 1
----> 1 rag_llama_stack_client_response = client.responses.create(
      2     model=LLAMA_STACK_MODEL_IDS[2],
      3     input="When did the Bering Land Bridge become a national preserve?",
      4     tools=[
      5         {
      6             "type": "file_search",
      7             "vector_store_ids": [vector_store_id],
      8         }
      9     ]
     10 )
     12 rag_llama_stack_client_response

File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/_utils/_utils.py:283, in required_args.<locals>.inner.<locals>.wrapper(*args, **kwargs)
    281             msg = f"Missing required argument: {quote(missing[0])}"
    282     raise TypeError(msg)
--> 283 return func(*args, **kwargs)

File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/resources/responses/responses.py:212, in ResponsesResource.create(self, input, model, instructions, max_infer_iters, previous_response_id, store, stream, temperature, text, tools, extra_headers, extra_query, extra_body, timeout)
    191 @required_args(["input", "model"], ["input", "model", "stream"])
    192 def create(
    193     self,
   (...)    210     timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
    211 ) -> ResponseObject | Stream[ResponseObjectStream]:
--> 212     return self._post(
    213         "/v1/openai/v1/responses",
    214         body=maybe_transform(
    215             {
    216                 "input": input,
    217                 "model": model,
    218                 "instructions": instructions,
    219                 "max_infer_iters": max_infer_iters,
    220                 "previous_response_id": previous_response_id,
    221                 "store": store,
    222                 "stream": stream,
    223                 "temperature": temperature,
    224                 "text": text,
    225                 "tools": tools,
    226             },
    227             response_create_params.ResponseCreateParamsStreaming
    228             if stream
    229             else response_create_params.ResponseCreateParamsNonStreaming,
    230         ),
    231         options=make_request_options(
    232             extra_headers=extra_headers, extra_query=extra_query, extra_body=extra_body, timeout=timeout
    233         ),
    234         cast_to=ResponseObject,
    235         stream=stream or False,
    236         stream_cls=Stream[ResponseObjectStream],
    237     )

File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/_base_client.py:1232, in SyncAPIClient.post(self, path, cast_to, body, options, files, stream, stream_cls)
   1218 def post(
   1219     self,
   1220     path: str,
   (...)   1227     stream_cls: type[_StreamT] | None = None,
   1228 ) -> ResponseT | _StreamT:
   1229     opts = FinalRequestOptions.construct(
   1230         method="post", url=path, json_data=body, files=to_httpx_files(files), **options
   1231     )
-> 1232     return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/sample-agent/venv-3_12_9/lib/python3.12/site-packages/llama_stack_client/_base_client.py:1034, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
   1031             err.response.read()
   1033         log.debug("Re-raising status error")
-> 1034         raise self._make_status_error_from_response(err.response) from None
   1036     break
   1038 assert response is not None, "could not resolve response (should never happen)"

BadRequestError: Error code: 400 - {'detail': {'errors': [{'loc': ['finish_reason'], 'msg': 'Input should be a valid string', 'type': 'string_type'}]}}

Server:

INFO     2025-08-15 15:31:06,857 console_span_processor:28 telemetry: 19:31:06.857 [START] /v1/openai/v1/responses
ERROR    2025-08-15 15:31:07,394 __main__:244 server: Error executing endpoint route='/v1/openai/v1/responses' method='post': 1 validation error for
         OpenAIChoice
         finish_reason
           Input should be a valid string
             For further information visit https://errors.pydantic.dev/2.11/v/string_type
INFO     2025-08-15 15:31:07,399 uvicorn.access:473 uncategorized: ::1:58064 - "POST /v1/openai/v1/responses HTTP/1.1" 400
INFO     2025-08-15 15:31:07,407 console_span_processor:42 telemetry: 19:31:07.404 [END] /v1/openai/v1/responses [StatusCode.OK] (546.56ms)
INFO     2025-08-15 15:31:07,409 console_span_processor:65 telemetry:  19:31:07.399 [ERROR] Error executing endpoint route='/v1/openai/v1/responses'
         method='post': 1 validation error for OpenAIChoice
         finish_reason
           Input should be a valid string
             For further information visit https://errors.pydantic.dev/2.11/v/string_type
INFO     2025-08-15 15:31:07,412 console_span_processor:65 telemetry:  19:31:07.401 [INFO] ::1:58064 - "POST /v1/openai/v1/responses HTTP/1.1" 400

Expected behavior

RAG shouldn't crash with either model. Maybe you get a better answer from one or the other.

jwm4 avatar Aug 15 '25 19:08 jwm4

Thanks for the detailed repro. The 400 “input should be a valid string” usually comes from Llama-Stack’s request validator: that route expects input: str.

When you enable RAG you’re likely passing a non-string (e.g., a dict/list of messages, or bytes after concatenating chunks). Some models/routes tolerate messages, others are strict about input: str, which explains why one model works and the other fails.

Quick checks

  1. Right before the call, log the type: assert isinstance(final_prompt, str).

  2. If you’re building messages, either:

    • Flatten to a single prompt string, e.g.

      prompt = f"{question}\n\nContext:\n{retrieved_text}"
      client.responses.create(model="llama-3.3", input=prompt)
      
    • Or, if the model/route supports chat format, switch to the chat schema that route expects (e.g., messages=[{"role":"user","content": prompt}]) instead of input.

  3. Make sure no bytes sneak in from file reads/OCR (.decode("utf-8") if needed).

If you share the exact responses.create(...) payload for the failing case, I can point to the precise field that needs to be converted.

onestardao avatar Aug 16 '25 08:08 onestardao

@onestardao thank you for looking into this! Based on the error message, I believe you are right that there in a non-string where a string should be, but I believe whatever logic is doing that is internal to Llama Stack. The Responses API, unlike most other APIs for getting answers from models, can do agentic reasoning with multiple turns. To be more clear and explicit about what's going on, I updated my sample code to the following:

rag_query = "When did the Bering Land Bridge become a national preserve?"
rag_model =  "llama-openai-compat/Llama-3.3-70B-Instruct"

assert isinstance(rag_query, str)
assert isinstance(rag_model, str)

rag_llama_stack_client_response = client.responses.create(
    model=rag_model,
    input=rag_query,
    tools=[
        {
            "type": "file_search",
            "vector_store_ids": [vector_store_id],
        }
    ]
)

I still get the same error:

BadRequestError: Error code: 400 - {'detail': {'errors': [{'loc': ['finish_reason'], 'msg': 'Input should be a valid string', 'type': 'string_type'}]}}

As I understand it, what's going on here is that this call causes Llama Stack to first prompt the model with the query and the list of tools (in this case, just the file search tool), then if a tool is selected execute it, then re-prompt the model with the outputs of that tool, and then if a final response is returned to the user, send it back to the user. If I had to guess, I would say that maybe something about the provider or the model for the first model call resulted in some sort of invalid state that lead Llama Stack to produce a non-string for the second model call, but I could be wrong about that.

For reference, when I call this same code except with "openai/gpt-4o" as the model, the rag_llama_stack_client_response has the following structure:

ID: resp-86c165e6-07cb-4cc4-9af2-38d8eb5bc4c8
Status: completed
Model: openai/gpt-3.5-turbo
Created at: 1755287351
Output items: 2

--- Output Item 1 ---
Output type: file_search_call
  Tool Call ID: call_7n7wK9E6Lc7eMOI89p2EKKaV
  Tool Status: completed
  Queries: Bering Land Bridge National Preserve establishment date
  Results: [{'file_id': '', 'filename': '', 'text': ' for abundant wildlife and sport\n \nfishing for five species of salmon.\n \nEstablished Dec. 2, 1980. <THERE WAS A LOT MORE TEXT HERE>', 'score': 2.447602654616083}]

--- Output Item 2 ---
Output type: message
Response content: The Bering Land Bridge National Preserve was established on December 2, 1980.

Here you can see the multi-step nature of the RAG process for this API.

jwm4 avatar Aug 16 '25 12:08 jwm4

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 30 days.

github-actions[bot] avatar Oct 16 '25 00:10 github-actions[bot]

Is this still an issue @jwm4?

nathan-weinberg avatar Nov 07 '25 19:11 nathan-weinberg

I'm not sure. I can try to replicate it using the latest Llama Stack version at some point if that would be helpful.

jwm4 avatar Nov 07 '25 19:11 jwm4

Yes please! But at your leisure, this isn't anything urgent, just seeing what bugs we can possibly close out

nathan-weinberg avatar Nov 07 '25 20:11 nathan-weinberg

The bug is at streaming.py:573 where chunk_finish_reason = "" is initialized. When streaming chunks don't provide a finish_reason (might be related to with Llama providers), this empty string fails OpenAI SDK validation since finish_reason must be a Literal["stop", "length", "tool_calls", "content_filter", "function_call"].

However I was not able to reproduce it with ollama with llama3.2:1b model, as ollama correctly returns finish_reason='stop' in the final chunk.

I think we should initialize chunk_finish_reason to 'stop' instead of empty string. @jwm4 can you check again vs llama API if this still happens.

r-bit-rry avatar Nov 26 '25 18:11 r-bit-rry