MMM About the word_emb for cross attention

Thanks for your great work! I wonder that the length of text usually less than 77. Why not mask the padding tokens in word_emb when performing cross attention?

Apr 22 '24 07:04 buptxyb666

Hi, we use [MASK] tokens for generation by iterative decoding and [PAD] tokens to fill up the shorter length samples. [PAD] tokens in CLIP model can be in similar manner. Since we only use text tokens as a condition (not for generation), no need [MASK] tokens for text.

Apr 23 '24 07:04 exitudio

I mean that when perform cross attention between word embed (key and value) and motion token(query), will the [PAD] tokens from CLIP introduce the noise to motion token ?

Compared with global text condition, additionally using the fine-grained word embeds can bring performance gain ?

Look forward your reply.

Apr 23 '24 08:04 buptxyb666

The model should learn to ignore [PAD] tokens (following CLIP). For more information, to get global (sentence) text embedding, CLIP simply applies linear layer to the local (word) embedding. https://github.com/openai/CLIP/blob/main/clip/model.py#L343-L356

We create a wrapper class here: https://github.com/exitudio/MMM/blob/2f7e3b25234a7fd0de32c7773eb5c39453500d66/train_t2m_trans.py#L76-L80

Applying local text embedding shows the trade-off between R-precision and FID score. Please see table 9 in the supplementary.

Apr 23 '24 08:04 exitudio