VaLM icon indicating copy to clipboard operation
VaLM copied to clipboard

Query regarding model capabilities

Open vishaal27 opened this issue 2 years ago • 1 comments

Hey!

I just finished reading your paper -- amazing work and the results look awesome!

I had one query regarding your model capabilities -- As I understand, at inference time, you do take in the most similar images from the cached index and perform the attention over the image features as keys and values. I wanted to know if this model could be repurposed to do prompt-specific image captioning for a given image. For example, given an image of an elephant standing near a lake next to a tree, can the model be prompted with something like "Describe the background of the image" or "In the distance, we can see" to output a caption that solely describes the background of the image (lake and tree) rather than the foreground consisting of the elephant. Since your model is trained auto-regressively, it seems to me that this should be feasible. Please let me know your thoughts!

vishaal27 avatar May 23 '22 13:05 vishaal27

Thanks for the great comments and ideas! We are currently working on adapting VaLM to vision-language tasks, especially image captioning and vqa. We would add more experimental results to the later version of VaLM. Thank you great again for your nice brainstorm with us!

Victorwz avatar May 24 '22 00:05 Victorwz