Roger Wang
Roger Wang
@Isotr0py It looks like the generation of image embedding from pixel values and merging with text embedding is currently tied together under `Phi3HDImageEmbedding`. Could you take a look to decouple...
> @ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime) No rush at all, and thank you for helping out!
@DarkLight1337 Please give this PR a first pass - I have updated all vision language models except two: - `Chameleon` (since the model itself is only input embedding based). -...
> The only small change I would make is to add an `assert_never` guard at the end of each `_parse_and_validate_image_input` function to make sure that we have handled all of...
On a side note, I realized supporting image embeddings as input is also not feasible for `Fuyu` due to the image processor adding additional logics with tokenizer. Maybe @Isotr0py has...
@DarkLight1337 This PR is ready for final review. I have added a test with Llava 1.5 and updated the documentation.
Hey @Andcircle! Thanks for reaching out! Yes as you mentioned, what this PR does is to allow image embeddings as input so that users can process image to embeddings separately...
> @ywang96 Thanks for your fast response! Yes, I think #6869 should free us. > > Just to be clarified, #6869 's use case can be much broader =) not...
> > > @ywang96 Thanks for your fast response! Yes, I think #6869 should free us. > > > Just to be clarified, #6869 's use case can be much...
@Isotr0py Hey do you think it makes sense to support image embeddings for Fuyu? (currently we cannot easily do it since the embedding creation is tied to tokenizer) We don't...