gill icon indicating copy to clipboard operation
gill copied to clipboard

about [img] token and train data

Open ALR-alr opened this issue 6 months ago • 0 comments

I have some questions with the paper. 1、As mentioned in this issue:https://github.com/kohjingyu/gill/issues/5#issuecomment-1619006482, it is said that "So the model will never produce [IMG2]...[IMG8] organically, but their representations are still helpful for feeding into the GILLMapper module for image generation." But if the model doesn't produce [IMG2]...[IMG8], how can we use the hidden states of these tokens to complete the image generation and retrieval tasks? If we use the representations from embedding matrixes, it means whatever image we input, the same feature we use to generate and retrieve? 2、Are the loss objects lc and lp trained in two stages? Because when training lp we need [IMG1]...[IMGr] as input, but when training lc, the input is interleaved image and text. 3、If we don't need interleaved image-text dataset, how does model know when to generate [IMG0] token? 4、How to force the [IMG2]...[IMG8] to be produced after [IMG0]? Thanks for your attention.

ALR-alr avatar Aug 25 '24 12:08 ALR-alr