LLaVA
LLaVA copied to clipboard
[Questio why 'mm_vision_select_layer' == -2 in config ? n]
Question
In training scripts, 'mm_vision_select_layer' is set to be -2, which means the penultimate layer's output of CLIP vision encoder is used as image features. I wonder why not use the last layer's output?