What does this PR do?

Actualize query rewrite in search API, add default_query_expansion_model and query_expansion_prompt in VectorStoresConfig.

Makes rewrite_query parameter functional in vector store search.

rewrite_query=false (default): Use original query
rewrite_query=true: Expand query via LLM, or fail gracefully if no LLM available

Adds 4 parameters toVectorStoresConfig:

default_query_expansion_model: LLM model for query expansion (optional)
query_expansion_prompt: Custom prompt template (optional, uses built-in default)
query_expansion_max_tokens: Configurable token limit (default: 100)
query_expansion_temperature: Configurable temperature (default: 0.3)

Enabled run.yaml:

  vector_stores:
    rewrite_query_params:
      model:
        provider_id: "meta-reference"
        model_id: "llama3_1_8b_instruct"
      # prompt defaults to built-in
      # max_tokens defaults to 100
      # temperature defaults to 0.3

Fully customized run.yaml:

  vector_stores:
    rewrite_query_params:
      model:
        provider_id: "meta-reference"
        model_id: "llama3_1_8b_instruct"
      prompt: "Improve this search query: {query}"
      max_tokens: 150
      temperature: 0.2

Test Plan

Added test and recording

Example script as well:

import asyncio
from llama_stack_client import LlamaStackClient
from io import BytesIO

def gen_file(client, text: str=""):
    file_buffer = BytesIO(text.encode('utf-8'))
    file_buffer.name = "my_file.txt"

    uploaded_file = client.files.create(
        file=file_buffer,
        purpose="assistants"
    )
    return uploaded_file

async def test_query_rewriting():
    client = LlamaStackClient(base_url="http://0.0.0.0:8321/")
    uploaded_file = gen_file(client, "banana banana apple")
    uploaded_file2 = gen_file(client, "orange orange kiwi")

    vs = client.vector_stores.create()
    xf_vs = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
    xf_vs1 = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file2.id)
    response1 = client.vector_stores.search(
                vector_store_id=vs.id,
                query="apple",
                max_num_results=3,
                rewrite_query=False
            )
    response2 = client.vector_stores.search(
                vector_store_id=vs.id,
                query="kiwi",
                max_num_results=3,
                rewrite_query=True,
            )

    print(f"\n🔵 Response 1 (rewrite_query=False):\n\033[94m{response1}\033[0m")
    print(f"\n🟢 Response 2 (rewrite_query=True):\n\033[92m{response2}\033[0m")

    for f in [uploaded_file.id, uploaded_file2.id]:
        client.files.delete(file_id=f)
    client.vector_stores.delete(vector_store_id=vs.id)

if __name__ == "__main__":
    asyncio.run(test_query_rewriting())

And see the screen shot of the server logs showing it worked. Screenshot 2025-11-19 at 1 16 03 PM

Notice the log:

 Query rewritten:
         'kiwi' → 'kiwi, a small brown or green fruit native to New Zealand, or a person having a fuzzy brown outer skin similar in appearance.'

So kiwi was expanded.

Nov 17 '25 04:11 franciscojavierarceo

@franciscojavierarceo fyi, the example has apple as the first query, the log shows kiwi twice

Nov 18 '25 13:11 mattf

what about having the vector store config specify the rewrite model and the request is rejected if none is configured?

Yeah, that's actually what I ended up adding, sorry requested reviews a bit premature I'll push that update soon.

Nov 18 '25 17:11 franciscojavierarceo

@franciscojavierarceo please update the description with the new proposed config and user interaction

Nov 19 '25 13:11 mattf

This pull request has merge conflicts that must be resolved before it can be merged. @franciscojavierarceo please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Nov 19 '25 15:11 mergify[bot]

the spec doesn't mention letting api callers override the vector store config, but the implementation appears to let them. is this intended?

@mattf this was added at the request of @raghotham in discord. I can update the PR description to reflect this for sure. I actually would have preferred to do so as a follow up to reduce scope.

the model validation logic should be left to the inference api. if the model is of the wrong type the inference api should handle the error generation.

Sounds good, will revise.

vector store model type validation should happen during startup, if at all. right now the admin sees a successful startup and the user sees 400 errors about a misconfiguration.

Will revise. 👍

Nov 20 '25 15:11 franciscojavierarceo

the spec doesn't mention letting api callers override the vector store config, but the implementation appears to let them. is this intended?

@mattf this was added at the request of @raghotham in discord. I can update the PR description to reflect this for sure. I actually would have preferred to do so as a follow up to reduce scope.

@raghotham imho we have two use cases here:

(a) production - the vector store config gets set at creation time, which should include locking in the embedding model + rewrite model + rewrite prompt. having the rewrite model & prompt flexibility allows for tailoring vector stores to an app (real value here).
(b) research & development - the best config for the vector store is not known / tuned yet. in this case, it's wasteful to recreate (re-embed) the vector store for each iteration to tune the rewrite model & prompt. therefore, passing a new model & prompt for each query is important.

if we need both (a) and (b) in this pr, i suggest not taking extra args on the query call, but using the existing vector store update path.

Nov 20 '25 15:11 mattf

Ack @mattf - +1 with the startup validation approach.

@franciscojavierarceo FYI I created #4184 to tackle startup validation for optional dependencies so admins catch misconfigurations at deploy time rather than users hitting 400 errors and needing to debug after. Looks like the validation changes here would align with that pattern.

On the production vs. R&D discussion: Matt's suggestion to use the "update vector store" path seems like a good middle ground - gives R&D the flexibility to iterate without per-query instability. Happy to discuss additional validation patterns in #4184 if it helps.

Nov 20 '25 21:11 anik120

if rewrite is requested but the feature isn't configured, the user should get a 400 instead of silently ignoring

Will do.

the globals are risky and will lead to bugs. there may already be one for two vector stores configured w/ different prompts. can we skip the globals and use something like self.vector_store.vector_stores_config in query_chunks instead?

@mattf if we want to use self.vector_store.vector_stores_config we'll have to modify VectorStoreWithIndex which will result in requiring us to modify all of the adapters. The globals are the alternative I thought made sense, let me know if you have alternative suggestions.

Nov 25 '25 19:11 franciscojavierarceo

@mattf I updated the code to handle the case where rewrite is requested but not configured.

Mind if I split the Adapter changes in a subsequent PR?

Nov 26 '25 04:11 franciscojavierarceo

if rewrite is requested but the feature isn't configured, the user should get a 400 instead of silently ignoring

Will do.

the globals are risky and will lead to bugs. there may already be one for two vector stores configured w/ different prompts. can we skip the globals and use something like self.vector_store.vector_stores_config in query_chunks instead?

@mattf if we want to use self.vector_store.vector_stores_config we'll have to modify VectorStoreWithIndex which will result in requiring us to modify all of the adapters. The globals are the alternative I thought made sense, let me know if you have alternative suggestions.

my mistake, you're right that query_chunks is adapter specific. i was trying to suggest lifting the handling out of the individual adapters entirely. i don't see why adapters need to even know that the user is requesting query rewriting. the adapter just accepts a string to query against. worst case, the rewrite could happen in the mixin?

btw, i noticed that the routing layer only ever passes down a query: str, but the adapters accept a query: InterleavedContent.

Nov 26 '25 11:11 mattf

@mattf mind taking a look?

Dec 05 '25 11:12 franciscojavierarceo

@mattf updated again, thanks for the feedback!

Dec 06 '25 02:12 franciscojavierarceo

This pull request has merge conflicts that must be resolved before it can be merged. @franciscojavierarceo please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Dec 06 '25 02:12 mergify[bot]

@franciscojavierarceo i pushed two commits, one moves the rewrite prompt validation to stack startup and the other moves the rewrite functionality into the router (making it available to all providers)

@mattf I had one small comment but I'm good with this. LMK if there's anything else you'd like to see.

Dec 06 '25 20:12 franciscojavierarceo

feat: Add support for query rewrite in vector_store.search

What does this PR do?

Test Plan