llama_index [Multimodal] General questions

[Multimodal] General questions

Open chengyjonathan opened this issue 1 year ago • 0 comments

Hi there, love the work here! I have a few questions about multimodal features.

Let's say I want to do some text -> image + text retrieval. Along the lines of the "Q & A over LlamaIndex Documentation"(https://github.com/jerryjliu/llama_index/blob/main/examples/multimodal/Multimodal.ipynb).

But let's say my images contain entities that are not recognized by clip, blip, etc. So it would be hard to support queries like "show me images of X-enitity?"

Do you have any recommendations on how to proceed?

Would adding information, along with other context, in the filename help support these kinds of queries? In the example, we can add metadata using the filename, but it isn't immediately clear to me if that metadata is encoded along with the image.
Is it possible to use our own fine tuned versions of blip or clip to embed our documents/queries? Ie. Along the lines of: (https://gpt-index.readthedocs.io/en/latest/how_to/embeddings.html).)
Other ideas?

Thank you!

Mar 20 '23 17:03 chengyjonathan

llama_index llama_index copied to clipboard

[Multimodal] General questions

llama_index
llama_index copied to clipboard