About the word_emb for cross attention
Thanks for your great work! I wonder that the length of text usually less than 77. Why not mask the padding tokens in word_emb when performing cross attention?
Hi, we use [MASK] tokens for generation by iterative decoding and [PAD] tokens to fill up the shorter length samples. [PAD] tokens in CLIP model can be in similar manner. Since we only use text tokens as a condition (not for generation), no need [MASK] tokens for text.
I mean that when perform cross attention between word embed (key and value) and motion token(query), will the [PAD] tokens from CLIP introduce the noise to motion token ?
Compared with global text condition, additionally using the fine-grained word embeds can bring performance gain ?
Look forward your reply.
The model should learn to ignore [PAD] tokens (following CLIP). For more information, to get global (sentence) text embedding, CLIP simply applies linear layer to the local (word) embedding. https://github.com/openai/CLIP/blob/main/clip/model.py#L343-L356
We create a wrapper class here: https://github.com/exitudio/MMM/blob/2f7e3b25234a7fd0de32c7773eb5c39453500d66/train_t2m_trans.py#L76-L80
Applying local text embedding shows the trade-off between R-precision and FID score. Please see table 9 in the supplementary.