Question about token slicing strategy in MMM and BAMM

Open Heibaiii opened this issue 6 months ago • 1 comments

Thank you for your excellent work on MMM and BAMM — I found both papers very insightful and inspiring！

I hope this message finds you well. I’ve been studying your works MMM and BAMM, both of which I find truly inspiring.

While reading through the implementations, I noticed that in MMM the token sequence output is sliced as [1:], whereas in BAMM it is sliced as [:-1]. Since both methods use a Transformer to process the sequence, I was wondering if you could kindly explain the reasoning or intuition behind this difference in slicing strategy.

The code of slicing strategy in MMM is ：train_t2m_trans.py, line 213 "cls_pred = trans_encoder(masked_input_indices, feat_clip_text, src_mask = seq_mask, att_txt=att_txt, word_emb=word_emb)[:, 1:]"

The code of slicing strategy in BAMM is ：transformer.py, line 362 "logits = self.trans_forward(x_ids, cond_vector, None, force_mask, cond_idx=cond_idx)[..., :-1]"

I noticed that the code architectures of MMM and BAMM are significantly different, maybe it's just a trivial question :)

Thank you very much for your time and your excellent work!

Best regards! @exitudio

Jun 10 '25 03:06 Heibaiii

Thank you for your interest in my papers.

In both MMM and BAMM, the text token is prepended to the beginning of the input sequence. For BAMM, which is an autoregressive model, the first output during inference corresponds to the position of the text token. So the last token prediction will not be used.

However, in MMM (masked models in general), there’s no need to predict the token at the first position. So, it’s more intuitive to align the input and output positions. So the prediction on text token position is not used.

Jun 11 '25 02:06 exitudio