alpaca-lora Attention mask counts padding tokens

Attention mask counts padding tokens

Open NormXU opened this issue 1 year ago • 3 comments

I notice the attention mask in the function generate_and_tokenize_prompt is weird.

If we set attention masks like

"attention_mask": [1] * (len(full_tokens))

then, the padded token will be counted to calculate the attention matrix.

Therefore, we can make use of the encode_plus to directly get the attention mask without padded tokens

Mar 21 '23 07:03 NormXU

Isn't this irrelevant if attention is causal? The padding tokens are at the end.

Mar 21 '23 21:03 tloen

I think they are different in the causal masks. Here is an example:

If the attention mask is

1  1  1  1  1  0  0

and the token list is

<s> <t1> <t2> <t3> </s> <pad> <pad>

Then we can calculate casual attention masks like

1  0  0  0  0  0  0
1  1  0  0  0  0  0
1  1  1  0  0  0  0
1  1  1  1  0  0  0
1  1  1  1  1  0  0
1  1  1  1  1  0  0
1  1  1  1  1  0  0

The causal attention mask stops before the padding tokens.

However, if the attention mask is

1  1  1  1  1  1  1

where the last two tokens are padding token Then we can calculate a casual attention mask like

1  0  0  0  0  0  0
1  1  0  0  0  0  0
1  1  1  0  0  0  0
1  1  1  1  0  0  0
1  1  1  1  1  0  0
1  1  1  1  1  1  0
1  1  1  1  1  1  1

This attention mask will force the model to learn to decode <pad> even after seeing </s>

Mar 22 '23 02:03 NormXU

Interesting, you're probably right (although the DataCollator below actually makes this entire section of the code a no-op for now). Let me test this.

Mar 22 '23 04:03 tloen

Resolved in #146

Mar 24 '23 19:03 tloen

alpaca-lora alpaca-lora copied to clipboard

Attention mask counts padding tokens

alpaca-lora
alpaca-lora copied to clipboard