R2R icon indicating copy to clipboard operation
R2R copied to clipboard

Embed media like images, audio, 3d, video or etc?

Open fire opened this issue 1 year ago • 6 comments

Hi,

I was wondering if it was in scope to embed media?

fire avatar Feb 26 '24 16:02 fire

That's definitely in scope. The best way to approach this would be to introduce the necessary embedding providers and to modify or create a new pipeline that shows an example of this in action.

I'm happy to team up on this.

emrgnt-cmplxty avatar Feb 26 '24 19:02 emrgnt-cmplxty

I have two primary usecases:

  1. The basic use-case is taking an image and making it an embedding for use. Like stable diffusion or the various combined vision-text models. There are a few models that can also also do video.
  2. My pet emerging technologies use-case is to take a 3d mesh from https://github.com/lucidrains/meshgpt-pytorch and have it auto complete vertices or search a database of other embedded meshes using the mesh-token-embedding.
  3. Someday maybe: audio, speech. I am not familiar at all with this.

fire avatar Feb 26 '24 19:02 fire

For image embedding, do you think we can fit it into the pipeline here [https://github.com/SciPhi-AI/R2R/blob/main/r2r/pipelines/basic/ingestion.py] with a specific embedding provider, or do you think we need to fundamentally rework the structure of the codebase in some way?

I think multi-modal is an important use case and I am very interested in figuring out how to best support this.

emrgnt-cmplxty avatar Feb 27 '24 20:02 emrgnt-cmplxty

I don't think I can drive multi-modal too much, but I'll see what spare time I can gather.

fire avatar Feb 29 '24 01:02 fire

The obvious question are like what happens when we have two different embedding models like token integers, how do we sync them?

fire avatar Feb 29 '24 01:02 fire