alpaca-lora
alpaca-lora copied to clipboard
Attention mask counts padding tokens
I notice the attention mask in the function generate_and_tokenize_prompt
is weird.
If we set attention masks like
"attention_mask": [1] * (len(full_tokens))
then, the padded token will be counted to calculate the attention matrix.
Therefore, we can make use of the encode_plus to directly get the attention mask without padded tokens
Isn't this irrelevant if attention is causal? The padding tokens are at the end.
I think they are different in the causal masks. Here is an example:
If the attention mask is
1 1 1 1 1 0 0
and the token list is
<s> <t1> <t2> <t3> </s> <pad> <pad>
Then we can calculate casual attention masks like
1 0 0 0 0 0 0
1 1 0 0 0 0 0
1 1 1 0 0 0 0
1 1 1 1 0 0 0
1 1 1 1 1 0 0
1 1 1 1 1 0 0
1 1 1 1 1 0 0
The causal attention mask stops before the padding tokens.
However, if the attention mask is
1 1 1 1 1 1 1
where the last two tokens are padding token Then we can calculate a casual attention mask like
1 0 0 0 0 0 0
1 1 0 0 0 0 0
1 1 1 0 0 0 0
1 1 1 1 0 0 0
1 1 1 1 1 0 0
1 1 1 1 1 1 0
1 1 1 1 1 1 1
This attention mask will force the model to learn to decode <pad>
even after seeing </s>
Interesting, you're probably right (although the DataCollator below actually makes this entire section of the code a no-op for now). Let me test this.
Resolved in #146