alpaca-lora icon indicating copy to clipboard operation
alpaca-lora copied to clipboard

Attention mask counts padding tokens

Open NormXU opened this issue 1 year ago • 3 comments

I notice the attention mask in the function generate_and_tokenize_prompt is weird.

If we set attention masks like

"attention_mask": [1] * (len(full_tokens))

then, the padded token will be counted to calculate the attention matrix.

Therefore, we can make use of the encode_plus to directly get the attention mask without padded tokens

NormXU avatar Mar 21 '23 07:03 NormXU

Isn't this irrelevant if attention is causal? The padding tokens are at the end.

tloen avatar Mar 21 '23 21:03 tloen

I think they are different in the causal masks. Here is an example:

If the attention mask is

1  1  1  1  1  0  0

and the token list is

<s> <t1> <t2> <t3> </s> <pad> <pad>

Then we can calculate casual attention masks like

1  0  0  0  0  0  0
1  1  0  0  0  0  0
1  1  1  0  0  0  0
1  1  1  1  0  0  0
1  1  1  1  1  0  0
1  1  1  1  1  0  0
1  1  1  1  1  0  0

The causal attention mask stops before the padding tokens.

However, if the attention mask is

1  1  1  1  1  1  1

where the last two tokens are padding token Then we can calculate a casual attention mask like

1  0  0  0  0  0  0
1  1  0  0  0  0  0
1  1  1  0  0  0  0
1  1  1  1  0  0  0
1  1  1  1  1  0  0
1  1  1  1  1  1  0
1  1  1  1  1  1  1

This attention mask will force the model to learn to decode <pad> even after seeing </s>

NormXU avatar Mar 22 '23 02:03 NormXU

Interesting, you're probably right (although the DataCollator below actually makes this entire section of the code a no-op for now). Let me test this.

tloen avatar Mar 22 '23 04:03 tloen

Resolved in #146

tloen avatar Mar 24 '23 19:03 tloen