Video-ChatGPT
Video-ChatGPT copied to clipboard
Cannot understand choice of mm_hidden_size 1024
Trying to understand how the spatial and temporal features fit into the projection layer. Based on the config file used to assign the mm.hidden_size on huggingface, it is 1024.
huggingface link: https://huggingface.co/mmaaz60/LLaVA-7B-Lightening-v1-1/blob/main/config.json
From what I understand, the frames are sampled at 100 frames and the clip encoder outputs a vector of 1024. A temporal mean will result in a vector of (number of patches, 1024) and a spatial mean of each frame will result in a (100(vector which size is the number of frames), 1024) does this mean the input shape of the projection layer is (num of patches + 100, 1024)?
I don't understand how the projection layer of 1024 accepts this size