llama-stack [RFC] Support multi modal retrieval on top of llama stack, inference provider side

[RFC] Support multi modal retrieval on top of llama stack, inference provider side

Open benjibc opened this issue 10 months ago • 0 comments

🚀 Describe the new functionality needed

We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:

Two endpoints: /v1/embeddings/text and /v1/embeddings/image.
Both accept a model parameter. Using a shared multimodal model (like "siglip-1") ensures embeddings are aligned.
Text endpoint accepts plain text input, image endpoint accepts only base64-encoded images.
The response format is consistent between text and image embeddings, simplifying integration.
This approach sets a foundation for multimodal retrieval and other advanced use cases involving both text and image data.

Endpoints

1. Text Embeddings Endpoint

URL: POST /v1/embeddings/text

Headers:

Content-Type: application/json
Authorization: Bearer <API_KEY>

Request Body:

model (string, required): The name of the model to use. For multimodal capability, this should be set to something like "siglip-1".
input (string, required): The text string to embed.
options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "input": "A photo of a white cat sitting on a chair.",
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

2. Image Embeddings Endpoint

URL: POST /v1/embeddings/image

Headers:

Content-Type: application/json
Authorization: Bearer <API_KEY>

Request Body:

model (string, required): The name of the model to use. For image embeddings aligned with the text embeddings above, use "siglip-1".
image (object, required):
- base64 (string, required): A base64-encoded representation of the image (e.g., PNG or JPEG). The client must pre-encode the image before sending.
options (object, optional):
- normalize (boolean, default: true): Whether to normalize the embedding vector.
- return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "image": {
    "base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
  },
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

Example Workflow for Multimodal Retrieval

Scenario: A user wants to find images similar to a textual concept ("A white cat on a chair").

Get the Text Embedding
Call the text embedding endpoint with your textual query:

POST /v1/embeddings/text
{
  "model": "siglip-1",
  "input": "A photo of a white cat sitting on a chair."
}

Assume the response:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...]
}

Store this embedding in your application as text_embedding.

Get the Image Embedding for a Candidate Image
Convert your candidate image to base64 (done client-side), then:
```
POST /v1/embeddings/image
{
  "model": "siglip-1",
  "image": {
    "base64": "iVBORw0K..."
  }
}
```
Assume the response:
```
{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...]
}
```
Store this embedding in your application as image_embedding.
Compare the Embeddings
Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:
```
similarity = cosine_similarity(text_embedding, image_embedding)
```
If similarity is high, this image is likely a good match for the textual query.

Error Handling

400 Bad Request: Missing model or input/image field, invalid base64 encoding.
401 Unauthorized: Invalid or missing API key.
415 Unsupported Media Type: If the Content-Type is not application/json.
500 Internal Server Error: Unexpected server issues.

Example Error Response:

{
  "error": {
    "message": "Invalid base64 image encoding",
    "type": "invalid_request_error"
  }
}

💡 Why is this needed? What if we don't build it?

Open to feedback here

Other thoughts

No response

Dec 20 '24 03:12 benjibc

llama-stack llama-stack copied to clipboard

[RFC] Support multi modal retrieval on top of llama stack, inference provider side

🚀 Describe the new functionality needed

Endpoints

1. Text Embeddings Endpoint

2. Image Embeddings Endpoint

Example Workflow for Multimodal Retrieval

Error Handling

💡 Why is this needed? What if we don't build it?

Other thoughts

llama-stack
llama-stack copied to clipboard