llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

[RFC] Support multi modal retrieval on top of llama stack, inference provider side

Open benjibc opened this issue 10 months ago • 0 comments

🚀 Describe the new functionality needed

We want to look into multi modal retrieval support for llama stack, I want to first discuss what the inference provider API side should look like. In this proposal:

  • Two endpoints: /v1/embeddings/text and /v1/embeddings/image.
  • Both accept a model parameter. Using a shared multimodal model (like "siglip-1") ensures embeddings are aligned.
  • Text endpoint accepts plain text input, image endpoint accepts only base64-encoded images.
  • The response format is consistent between text and image embeddings, simplifying integration.
  • This approach sets a foundation for multimodal retrieval and other advanced use cases involving both text and image data.

Endpoints

1. Text Embeddings Endpoint

URL: POST /v1/embeddings/text

Headers:

  • Content-Type: application/json
  • Authorization: Bearer <API_KEY>

Request Body:

  • model (string, required): The name of the model to use. For multimodal capability, this should be set to something like "siglip-1".
  • input (string, required): The text string to embed.
  • options (object, optional):
    • normalize (boolean, default: true): Whether to normalize the embedding vector.
    • return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "input": "A photo of a white cat sitting on a chair.",
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.0123, -0.0456, 0.0789, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 20
  }
}

2. Image Embeddings Endpoint

URL: POST /v1/embeddings/image

Headers:

  • Content-Type: application/json
  • Authorization: Bearer <API_KEY>

Request Body:

  • model (string, required): The name of the model to use. For image embeddings aligned with the text embeddings above, use "siglip-1".
  • image (object, required):
    • base64 (string, required): A base64-encoded representation of the image (e.g., PNG or JPEG). The client must pre-encode the image before sending.
  • options (object, optional):
    • normalize (boolean, default: true): Whether to normalize the embedding vector.
    • return_dims (boolean, default: false): Whether to return the dimensionality of the embedding.

Example Request:

{
  "model": "siglip-1",
  "image": {
    "base64": "iVBORw0KGgoAAAANSUhEUgAAA... (rest of base64 encoded image)"
  },
  "options": {
    "normalize": true,
    "return_dims": false
  }
}

Example Response:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

If return_dims = true:

{
  "model": "siglip-1",
  "embedding": [0.023, -0.081, 0.572, ...],
  "embedding_dimensions": 1024,
  "usage": {
    "embedding_compute_time_ms": 45
  }
}

Example Workflow for Multimodal Retrieval

Scenario: A user wants to find images similar to a textual concept ("A white cat on a chair").

  1. Get the Text Embedding
    Call the text embedding endpoint with your textual query:

    POST /v1/embeddings/text
    {
      "model": "siglip-1",
      "input": "A photo of a white cat sitting on a chair."
    }
    

    Assume the response:

    {
      "model": "siglip-1",
      "embedding": [0.0123, -0.0456, 0.0789, ...]
    }
    

    Store this embedding in your application as text_embedding.

  2. Get the Image Embedding for a Candidate Image
    Convert your candidate image to base64 (done client-side), then:

    POST /v1/embeddings/image
    {
      "model": "siglip-1",
      "image": {
        "base64": "iVBORw0K..."
      }
    }
    

    Assume the response:

    {
      "model": "siglip-1",
      "embedding": [0.023, -0.081, 0.572, ...]
    }
    

    Store this embedding in your application as image_embedding.

  3. Compare the Embeddings
    Because both embeddings come from the same aligned model ("siglip-1"), you can compute cosine similarity or another metric to see how "close" the text concept is to the image content:

    similarity = cosine_similarity(text_embedding, image_embedding)
    

    If similarity is high, this image is likely a good match for the textual query.


Error Handling

  • 400 Bad Request: Missing model or input/image field, invalid base64 encoding.
  • 401 Unauthorized: Invalid or missing API key.
  • 415 Unsupported Media Type: If the Content-Type is not application/json.
  • 500 Internal Server Error: Unexpected server issues.

Example Error Response:

{
  "error": {
    "message": "Invalid base64 image encoding",
    "type": "invalid_request_error"
  }
}

💡 Why is this needed? What if we don't build it?

Open to feedback here

Other thoughts

No response

benjibc avatar Dec 20 '24 03:12 benjibc