SmartEdit Qformer mm_projector issue

Hello, thanks for ur amazing work! I have a problem when running this code, could u help me to solve it?

When I train the script DS_MLLMSD11_train.py, I encountered this error.

  File "/SmartEdit/model/DS_MLLMSD11_model.py", line 243, in load_pretrain_MLLM_alignment
    mm_projector_param = {'weight': weights.pop('mm_projector.weight'), 'bias': weights.pop('mm_projector.bias')}
KeyError: 'mm_projector.weight'

The directory of the SD_QFormer_conversation_33tokens is: /SmartEdit/checkpoints/stage1_CC12M_alignment_7b/embeddings_qformer/checkpoint-150000.bin

In addition, I run the stage1 inference code successfully.

The qformer model trained in the first stage is a 6 block bert-based model. I print the keys in the model weight dict, it seems that the model doesn't contain "mm_projector" item.

And there is another question that I'm confused about, I think the "mm_projector" module only contains in the llava module, its functionality is to convert the image embedding(using vit) in the image latent space into the text latent space. I have no idea why qformer module needs mm_projector module. I think these two are completely different things.

Sep 24 '24 11:09 zjutkarma

Thanks for your interest in our work. You might be right and maybe this is a small code error when we push on github, while it goes will after modification.

Sep 28 '24 05:09 yuzhou914

Thanks very much for ur reply!! Looking forward to the updated version!

There is another question in this function, I can't align the Llama output and qformer input in the code. weights.pop('lm_head.weight') is a [33, 4096] tensor and the self.config.num_new_tokens is 35(32 + 2 + 1 in previous setting)

# 1. vec2word: Linear(in_features=4096, out_features=32035, bias=False)
        LLaMA_lm_haed = weights.pop('lm_head.weight')
        LLaMA_lm_haed = LLaMA_lm_haed[-self.config.num_new_tokens:]
        self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed

Is there anything wrong in my configuration step? I set the configuration file following the original code.

Sep 28 '24 06:09 zjutkarma

@zjutkarma Hi, I'm having both issues you mentioned:

The mm_projector KeyError when loading QFormer weights
The token number mismatch between LLaMA lm_head (33) and config.num_new_tokens (35)

Could you share:

How did you resolve the mm_projector issue?
How did you handle the token number mismatch? Did you:
- Modify num_new_tokens in MLLMSD config to match Stage1 (33)?
- Or adjust Stage1 training to use 35 tokens?
- Or find another solution?

Really appreciate your help! Thanks!

Jan 16 '25 06:01 XuwuChen443

@XuwuChen443 Hiii, Yes it's a little bit tricky because it contains multiple components. Here's my experience.

I check my code, I comment the mm_projector line, it works. This function doesn't need to use the mm_projector.

        #print('mm_projector weight:', weights['mm_projector.weight'] == LLaVA_00002_weights['model.mm_projector.weight'])
        #print('mm_projector bias:', weights['mm_projector.bias'] == LLaVA_00002_weights['model.mm_projector.bias'])

For the second question, a little bit hard to remember, but I can show u the code of the function I used. I think it is the latter solution you provided maybe. Hope it works for you. :)

def load_pretrain_MLLM_alignment(self, SD_QFormer_conversation_33tokens, LLaVA_00002):
        weights = torch.load(SD_QFormer_conversation_33tokens, map_location="cpu")
        print("q_former weight", list(weights.keys()))
        LLaVA_00002_weights = torch.load(LLaVA_00002, map_location="cpu")

        # 1. vec2word: Linear(in_features=4096, out_features=32035, bias=False)
        LLaMA_lm_haed = weights.pop('lm_head.weight')
        print("llama head <-> qformer层数:", LLaMA_lm_haed.shape) #torch.Size([33, 4096])
        LLaMA_lm_haed = LLaMA_lm_haed[-self.config.num_new_tokens:] 
        print("num_new_tokens:", self.config.num_new_tokens) 
        print(LLaMA_lm_haed.shape) #torch.Size([33, 4096])
        print("num_new_tokens:", self.config.num_new_tokens)
        
        #[32000, 4096]
        self.lm_head.weight.data[-self.config.num_new_tokens:] = LLaMA_lm_haed
        original_LLaMA_lm_head = self.original_lm_head_value
        self.lm_head.weight.data[:-self.config.num_new_tokens] = original_LLaMA_lm_head
        print('Matching language model head:', self.lm_head.weight.data[0] == self.original_LLM_language_model_head_0)

        # 2. word2vec: Embedding(32035, 4096)
        LLaMA_word2vec = weights.pop('model.embed_tokens.weight')
        print("embed_tokens:", LLaMA_word2vec.shape)

        LLaMA_word2vec = LLaMA_word2vec[-self.config.num_new_tokens:]
        self.model.embed_tokens.weight.data[-self.config.num_new_tokens:] = LLaMA_word2vec
        original_LLaMA_embed_tokens = self.origin_inp_embedding
        self.model.embed_tokens.weight.data[:-self.config.num_new_tokens] = original_LLaMA_embed_tokens
        print('Matching word embedding:', self.model.embed_tokens.weight.data[0] == self.original_LLM_word_embedding_0)

        # 3. mm_projector

        mm_projector_param = {'weight': weights.pop('mm_projector.weight'), 'bias': weights.pop('mm_projector.bias')}
        self.mm_projector.load_state_dict(mm_projector_param, strict=True)

        # 4. SD_Query and SD_Qformer -> remove 'sd_qformer.'
        self.sd_query_tokens.data = weights.pop('sd_query_tokens').float()
        self.sd_qformer.load_state_dict({k[len('sd_qformer.'):]: v for k, v in weights.items()})
        print('Loading embeddings for Qformer checkpoint:', self.sd_qformer.load_state_dict({k[len('sd_qformer.'):]: v for k, v in weights.items()}, strict=True))

Jan 17 '25 08:01 zjutkarma

@zjutkarma Thanks! It helps me a lot.

Jan 20 '25 04:01 XuwuChen443

Did you solve the problem?

Jun 06 '25 16:06 baihuple