Why double the max sequence length while precomputing the frequency for rotary embedding?
https://github.com/facebookresearch/llama/blob/57b0eb62de0636e75af471e49e2f1862d908d9d8/llama/model.py#L219
Is there anyone who explain about why the sequence length is doubled?
To the best of my knowledge, we don't have to constraint the maximum sequence length when we use rotary embedding because there is no learnable parameter depending on the sequence length although it is empirically not working well when the model sees the sequence longer than the sequence they trained on.
Does max_seq_len inside the configuration mean that LLaMA is trained on the sequence which is at most 2048 tokens? Then what is the max_seq_len * 2? Is it just a trick to implement RoPE?
Can we increase the context length say upto 4k, by fine-tuning it?
I think there is no Ok to do something. All the things in the deep learning are vague. You can do this, but I couldn't guarantee that.