ImageBind
ImageBind copied to clipboard
[Help] How can I generate images or audio?
Hey, could someone explain me (no AI/ML background) on how this model could be used to generate images or audio? I can generate 3 x 3 tensors in code, no problem, but what's the next step to leverage these tensors?
I'm pretty sure I'm not the only one who will stand here and think to himself: "what now?" I would appreciate a hint or anything that would explain how I could use these tensors without having to read the paper (which I tried but didn't really grasp).
Same here, i just need some examples.
Yeah, I need them too :)
Same. I am also interested in an example for the embedding space arithmetic showcased in Figure 4 of the paper where they retrieve an image using an image and audio.
You may find ViT-Lens of interests, which works with MLLM to generate texts or images from other modalities :)