should I provide a true attention mask?
Hello, thank you for releasing the code and weights of LLaDA as open source.
I'm a bit confused about why attn_mask is set to None. When fine-tuning LLaDA with padded input data (specifically left-padded using a padding token), is this setting still appropriate? Or should I instead provide a proper attention mask to account for the padding?
Here's the relevant code snippet:
# Get the attention scores.
# shape: (B, nh, T, hs)
att = self._scaled_dot_product_attention(
q,
k,
v,
attn_mask=None,
dropout_p=0.0 if not self.training else self.config.attention_dropout,
is_causal=False,
)
Code reference: here
Thank you in advance for your help!
same issue #89
Thanks for your interest!
Since we didn't use attention masks during both pre-training and SFT processes, we simply set it to None for convenience. However, we have to admit that attention masks might be useful in certain scenarios, and I'm considering updating our code.
Hi, I've fixed this bug in this PR, could you test it? @colinzhaoxp @NieShenRuc
Hi, I've fixed this bug in this PR, could you test it? @colinzhaoxp @NieShenRuc
I will test it in hours
@Kamichanw thanks for your work!
I find that your add a new argument attention_mask for attention function.
I have a question about the argument attention_bias in attention function, is it useless? and What role did it play?
I've asked @NieShenRuc , attention_bias is not engaged to final output. It may play a role similar to the attention mask in the original early stage of training, but the relevant code was later removed by the author. I think it can be removed safely now.
Hi @Kamichanw I have tested it, and it works well. Sorry for the delay.