alpaca-lora About padding

Hi, I noticed that you set padding_side='left' in finetune.py. However, in llama the default padding side seems to be 'right'. Would this inconsistency causes certain problems such as performance drop?

Apr 05 '23 15:04 stellaludai

I think default llama doesn't even use padding tokens.

Apr 05 '23 23:04 ElleLeonne

how can it train without padding? I thought padding is necessary to collate sentences of different lengths into the same batch?

Apr 06 '23 07:04 stellaludai

@stellaludai correct but during pre-training you usually don't use different lenghts. See HF tutorial for more details: https://huggingface.co/course/chapter7/6?fw=pt#preparing-the-dataset

Apr 06 '23 08:04 chrisociepa

@chrisociepa I see, do you mean during llm pretraining, the sequences are often chunked into equal-length pieces, unlike at inference where one sequence occupies one dimension in the input batch?

Apr 06 '23 12:04 stellaludai

yes, exactly

Apr 06 '23 12:04 chrisociepa

@chrisociepa @ElleLeonne Thank you very much for sharing!

Apr 06 '23 14:04 stellaludai

I think @ElleLeonne just refers to the fact that there is no pad_token in the LlamaTokenizer config, the default value is None.

Jun 29 '23 22:06 Nsigma-Bill

@stellaludai correct but during pre-training you usually don't use different lenghts. See HF tutorial for more details: https://huggingface.co/course/chapter7/6?fw=pt#preparing-the-dataset

@chrisociepa Thank you for sharing the link. But I don't think they are about the same thing as the question itself:

The example in the link manually throws away chunks where length != context_length but the alpaca-Lora did not do this when creating dataset. Rather, it does the following padding in datacollator:

data_collator=transformers.DataCollatorForSeq2Seq(
            tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
        )

I think padding_side = left/right should not matter but attention_mask should be implemented which is not the case in alpaca-lora IMHO.

Jun 29 '23 22:06 Nsigma-Bill

@Nsigma-Bill Thanks for sharing your opinion, but actually the above answers solved my question. In the link, the last remaining tail are thrown away if it is shorter than the previous chunks, which aims to avoid padding issues. As for attention mask, the transformers package should already have done for this purpose.

Jul 07 '23 01:07 stellaludai

About padding_side='left'