LLaDA should I provide a true attention mask?

Hello, thank you for releasing the code and weights of LLaDA as open source.

I'm a bit confused about why attn_mask is set to None. When fine-tuning LLaDA with padded input data (specifically left-padded using a padding token), is this setting still appropriate? Or should I instead provide a proper attention mask to account for the padding?

Here's the relevant code snippet:

# Get the attention scores.
# shape: (B, nh, T, hs)
att = self._scaled_dot_product_attention(
    q,
    k,
    v,
    attn_mask=None,
    dropout_p=0.0 if not self.training else self.config.attention_dropout,
    is_causal=False,
)

Code reference: here

Thank you in advance for your help!

Jun 29 '25 16:06 colinzhaoxp

same issue #89

Jun 29 '25 16:06 colinzhaoxp

Thanks for your interest!

Since we didn't use attention masks during both pre-training and SFT processes, we simply set it to None for convenience. However, we have to admit that attention masks might be useful in certain scenarios, and I'm considering updating our code.

Jun 30 '25 02:06 nieshenx

Hi, I've fixed this bug in this PR, could you test it? @colinzhaoxp @NieShenRuc

Jul 05 '25 17:07 Kamichanw

Hi, I've fixed this bug in this PR, could you test it? @colinzhaoxp @NieShenRuc

I will test it in hours

Jul 06 '25 02:07 colinzhaoxp

@Kamichanw thanks for your work! I find that your add a new argument attention_mask for attention function. I have a question about the argument attention_bias in attention function, is it useless? and What role did it play?

Jul 06 '25 03:07 colinzhaoxp

I've asked @NieShenRuc , attention_bias is not engaged to final output. It may play a role similar to the attention mask in the original early stage of training, but the relevant code was later removed by the author. I think it can be removed safely now.

Jul 06 '25 03:07 Kamichanw

Hi @Kamichanw I have tested it, and it works well. Sorry for the delay.

Jul 11 '25 02:07 colinzhaoxp