llama-stack
llama-stack copied to clipboard
[RFC] Support multi modal retrieval on top of llama stack, inference provider side
🚀 Describe the new functionality needed
We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:
- Two endpoints:
/v1/embeddings/textand/v1/embeddings/image. - Both accept a
modelparameter. Using a shared multimodal model (like"siglip-1") ensures embeddings are aligned. - Text endpoint accepts plain text input, image endpoint accepts only base64-encoded images.
- The response format is consistent between text and image embeddings, simplifying integration.
- This approach sets a foundation for multimodal retrieval and other advanced use cases involving both text and image data.
Endpoints
1. Text Embeddings Endpoint
URL: POST /v1/embeddings/text
Headers:
Content-Type: application/jsonAuthorization: Bearer <API_KEY>
Request Body:
- model (string, required): The name of the model to use. For multimodal capability, this should be set to something like
"siglip-1". - input (string, required): The text string to embed.
- options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.
Example Request:
{
"model": "siglip-1",
"input": "A photo of a white cat sitting on a chair.",
"options": {
"normalize": true,
"return_dims": false
}
}
Example Response:
{
"model": "siglip-1",
"embedding": [0.0123, -0.0456, 0.0789, ...],
"usage": {
"embedding_compute_time_ms": 20
}
}
If return_dims = true:
{
"model": "siglip-1",
"embedding": [0.0123, -0.0456, 0.0789, ...],
"embedding_dimensions": 1024,
"usage": {
"embedding_compute_time_ms": 20
}
}
2. Image Embeddings Endpoint
URL: POST /v1/embeddings/image
Headers:
Content-Type: application/jsonAuthorization: Bearer <API_KEY>
Request Body:
- model (string, required): The name of the model to use. For image embeddings aligned with the text embeddings above, use
"siglip-1". - image (object, required):
- base64 (string, required): A base64-encoded representation of the image (e.g., PNG or JPEG). The client must pre-encode the image before sending.
- options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.
Example Request:
{
"model": "siglip-1",
"image": {
"base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
},
"options": {
"normalize": true,
"return_dims": false
}
}
Example Response:
{
"model": "siglip-1",
"embedding": [0.023, -0.081, 0.572, ...],
"usage": {
"embedding_compute_time_ms": 45
}
}
If return_dims = true:
{
"model": "siglip-1",
"embedding": [0.023, -0.081, 0.572, ...],
"embedding_dimensions": 1024,
"usage": {
"embedding_compute_time_ms": 45
}
}
Example Workflow for Multimodal Retrieval
Scenario: A user wants to find images similar to a textual concept ("A white cat on a chair").
-
Get the Text Embedding
Call the text embedding endpoint with your textual query:POST /v1/embeddings/text { "model": "siglip-1", "input": "A photo of a white cat sitting on a chair." }Assume the response:
{ "model": "siglip-1", "embedding": [0.0123, -0.0456, 0.0789, ...] }Store this embedding in your application as
text_embedding. -
Get the Image Embedding for a Candidate Image
Convert your candidate image to base64 (done client-side), then:POST /v1/embeddings/image { "model": "siglip-1", "image": { "base64": "iVBORw0K..." } }Assume the response:
{ "model": "siglip-1", "embedding": [0.023, -0.081, 0.572, ...] }Store this embedding in your application as
image_embedding. -
Compare the Embeddings
Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:similarity = cosine_similarity(text_embedding, image_embedding)If
similarityis high, this image is likely a good match for the textual query.
Error Handling
- 400 Bad Request: Missing
modelorinput/imagefield, invalid base64 encoding. - 401 Unauthorized: Invalid or missing API key.
- 415 Unsupported Media Type: If the
Content-Typeis notapplication/json. - 500 Internal Server Error: Unexpected server issues.
Example Error Response:
{
"error": {
"message": "Invalid base64 image encoding",
"type": "invalid_request_error"
}
}
💡 Why is this needed? What if we don't build it?
Open to feedback here
Other thoughts
No response