Qformer mm_projector issue
Hello, thanks for ur amazing work! I have a problem when running this code, could u help me to solve it?
When I train the script DS_MLLMSD11_train.py, I encountered this error.
File "/SmartEdit/model/DS_MLLMSD11_model.py", line 243, in load_pretrain_MLLM_alignment
mm_projector_param = {'weight': weights.pop('mm_projector.weight'), 'bias': weights.pop('mm_projector.bias')}
KeyError: 'mm_projector.weight'
The directory of the SD_QFormer_conversation_33tokens is: /SmartEdit/checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-150000.bin
In addition, I run the stage1 inference code successfully.
The qformer model trained in the first stage is a 6 block bert-based model. I print the keys in the model weight dict, it seems that the model doesn't contain "mm_projector" item.
And there is another question that I'm confused about, I think the "mm_projector" module only contains in the llava module, its functionality is to convert the image embedding(using vit) in the image latent space into the text latent space. I have no idea why qformer module needs mm_projector module. I think these two are completely different things.
Thanks for your interest in our work. You might be right and maybe this is a small code error when we push on github, while it goes will after modification.
Thanks very much for ur reply!! Looking forward to the updated version!
There is another question in this function, I can't align the Llama output and qformer input in the code. weights.pop('lm_head.weight') is a [33, 4096] tensor and the self.config.num_new_tokens is 35(32 + 2 + 1 in previous setting)
# 1. vec2word: Linear(in_features=4096, out_features=32035, bias=False)
LLaMA_lm_haed = weights.pop('lm_head.weight')
LLaMA_lm_haed = LLaMA_lm_haed[-self.config.num_new_tokens:]
self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed
Is there anything wrong in my configuration step? I set the configuration file following the original code.
@zjutkarma Hi, I'm having both issues you mentioned:
- The
mm_projectorKeyError when loading QFormer weights - The token number mismatch between LLaMA lm_head (33) and config.num_new_tokens (35)
Could you share:
- How did you resolve the mm_projector issue?
- How did you handle the token number mismatch? Did you:
- Modify num_new_tokens in MLLMSD config to match Stage1 (33)?
- Or adjust Stage1 training to use 35 tokens?
- Or find another solution?
Really appreciate your help! Thanks!
@XuwuChen443 Hiii, Yes it's a little bit tricky because it contains multiple components. Here's my experience.
- I check my code, I comment the mm_projector line, it works. This function doesn't need to use the mm_projector.
#print('mm_projector weight:', weights['mm_projector.weight'] == LLaVA_00002_weights['model.mm_projector.weight'])
#print('mm_projector bias:', weights['mm_projector.bias'] == LLaVA_00002_weights['model.mm_projector.bias'])
- For the second question, a little bit hard to remember, but I can show u the code of the function I used. I think it is the latter solution you provided maybe. Hope it works for you. :)
def load_pretrain_MLLM_alignment(self, SD_QFormer_conversation_33tokens, LLaVA_00002):
weights = torch.load(SD_QFormer_conversation_33tokens, map_location="cpu")
print("q_former weight", list(weights.keys()))
LLaVA_00002_weights = torch.load(LLaVA_00002, map_location="cpu")
# 1. vec2word: Linear(in_features=4096, out_features=32035, bias=False)
LLaMA_lm_haed = weights.pop('lm_head.weight')
print("llama head <-> qformer层数:", LLaMA_lm_haed.shape) #torch.Size([33, 4096])
LLaMA_lm_haed = LLaMA_lm_haed[-self.config.num_new_tokens:]
print("num_new_tokens:", self.config.num_new_tokens)
print(LLaMA_lm_haed.shape) #torch.Size([33, 4096])
print("num_new_tokens:", self.config.num_new_tokens)
#[32000, 4096]
self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed
original_LLaMA_lm_head = self.original_lm_head_value
self.lm_head.weight.data[:-self.config.num_new_tokens] = original_LLaMA_lm_head
print('Matching language model head:', self.lm_head.weight.data[0] == self.original_LLM_language_model_head_0)
# 2. word2vec: Embedding(32035, 4096)
LLaMA_word2vec = weights.pop('model.embed_tokens.weight')
print("embed_tokens:", LLaMA_word2vec.shape)
LLaMA_word2vec = LLaMA_word2vec[-self.config.num_new_tokens:]
self.model.embed_tokens.weight.data[-self.config.num_new_tokens:] = LLaMA_word2vec
original_LLaMA_embed_tokens = self.origin_inp_embedding
self.model.embed_tokens.weight.data[:-self.config.num_new_tokens] = original_LLaMA_embed_tokens
print('Matching word embedding:', self.model.embed_tokens.weight.data[0] == self.original_LLM_word_embedding_0)
# 3. mm_projector
mm_projector_param = {'weight': weights.pop('mm_projector.weight'), 'bias': weights.pop('mm_projector.bias')}
self.mm_projector.load_state_dict(mm_projector_param, strict=True)
# 4. SD_Query and SD_Qformer -> remove 'sd_qformer.'
self.sd_query_tokens.data = weights.pop('sd_query_tokens').float()
self.sd_qformer.load_state_dict({k[len('sd_qformer.'):]: v for k, v in weights.items()})
print('Loading embeddings for Qformer checkpoint:', self.sd_qformer.load_state_dict({k[len('sd_qformer.'):]: v for k, v in weights.items()}, strict=True))
@zjutkarma Thanks! It helps me a lot.
Did you solve the problem?