llama-recipes
llama-recipes copied to clipboard
Should we adjust attention mask when packing data?
🚀 The feature, motivation and pitch
Hello,
I am hoping to seek some guidance on a common scenario. I am wondering when selecting "packing" (instead of padding) as the batching strategy, whether it would be necessary to adjust attention mask?
The idea is similar to this post and this post. The concern is when packing multiple data include one entry, we might need to prevent one data to attend to other irrelevant data points.
In google's Flan paper, it was mentioned that
"We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using an end-of-sequence token. Masking is applied to prevent the tokens from attending to others across the packed example boundary".
However, in the LLaMA-2 paper, it was only mentioned that
"To ensure the model sequence length is properly filled, we concatenate all the prompts and answers from the training set. A special token is utilized to separate the prompt and answer segments."
In the Vanilla implementation of data concatenate from LLaMA-recipes, masking attention is not implemented. Some sample implantation of attention mask adjustment could be found here by Google, and here.
I am not entirely sure whether this step is necessary, or alternatively as long as we separate different data points with special characters (like EOS), the model would learn to figure out on its own.
Another likely very relevant consideration is that, in Huggingface's BetterTransformer (Fast Kernel in LLaMA-recipes) does not support attention_mask:
The PyTorch-native
scaled_dot_product_attention
operator can only dispatch to Flash Attention if noattention_mask
is provided. Thus, by default in training mode, the BetterTransformer integration drops the mask support and can only be used for training that do not require a padding mask for batched training. BetterTransformer is not suited for the fine-tuning of models on tasks that requires a padding mask.
Also, note there have been several discussions about this topic in the Huggingface community: (here, here, and here). I personally would agree with @younesbelkada that "adding the EOS token is an enough signal for the model". But curious on the take from Meta.
Many thanks.
Alternatives
No response
Additional context
No response
@hanyin88 I agree that adding EOS should be enough specially if the dataset is not very correlated. We follow the similar practice.
Just came across this topic/problem recently and noticed this implementation specifically addressing this issue. They do adjust the attention mask to avoid "cross-contamination" between packed samples. https://github.com/MeetKai/functionary/tree/main/functionary/train/packing#assert-implementation