gill
gill copied to clipboard
about [img] token and train data
I have some questions with the paper. 1、As mentioned in this issue:https://github.com/kohjingyu/gill/issues/5#issuecomment-1619006482, it is said that "So the model will never produce [IMG2]...[IMG8] organically, but their representations are still helpful for feeding into the GILLMapper module for image generation." But if the model doesn't produce [IMG2]...[IMG8], how can we use the hidden states of these tokens to complete the image generation and retrieval tasks? If we use the representations from embedding matrixes, it means whatever image we input, the same feature we use to generate and retrieve? 2、Are the loss objects lc and lp trained in two stages? Because when training lp we need [IMG1]...[IMGr] as input, but when training lc, the input is interleaved image and text. 3、If we don't need interleaved image-text dataset, how does model know when to generate [IMG0] token? 4、How to force the [IMG2]...[IMG8] to be produced after [IMG0]? Thanks for your attention.