llama Why double the max sequence length while precomputing the frequency for rotary embedding?

https://github.com/facebookresearch/llama/blob/57b0eb62de0636e75af471e49e2f1862d908d9d8/llama/model.py#L219

Is there anyone who explain about why the sequence length is doubled?

Apr 11 '23 13:04 sh0416

To the best of my knowledge, we don't have to constraint the maximum sequence length when we use rotary embedding because there is no learnable parameter depending on the sequence length although it is empirically not working well when the model sees the sequence longer than the sequence they trained on.

Does max_seq_len inside the configuration mean that LLaMA is trained on the sequence which is at most 2048 tokens? Then what is the max_seq_len * 2? Is it just a trick to implement RoPE?

Apr 11 '23 13:04 sh0416

Can we increase the context length say upto 4k, by fine-tuning it?

Apr 12 '23 11:04 milsun

I think there is no Ok to do something. All the things in the deep learning are vague. You can do this, but I couldn't guarantee that.

Apr 12 '23 12:04 sh0416