LongLM
LongLM copied to clipboard
Question about equation 4 and Table 5 caption in paper
Hi! I have a question that may seem simple, but I think I'm overlooking something.
Assume Phi-2's context window is 2K. When we apply a group size ($G_s$) of 4 and neighbor tokens ($w_n$) of 512, according to Equation 4: $(L - w_n) \times G_s + w_n$, the calculated extended pre-training length is approximately $(2K - 0.5K) \times 4 + 0.5K = 6.5K$.
However, in the paper, the caption of Table 5 states:
The vanilla Phi-2 has a 2k context window. SelfExtend extends Phi-2 to 4K ($G_s=4, w_n=512$), 6K ($G_s=8, w_n=512$).
Considering Phi-2's maximum token length is 2K, I am puzzled as to how the extended lengths of 4K and 6K are derived. What am I missing here? Any insights would be greatly appreciated.
Hi! We have some empirical results about this. You may check out this link: https://github.com/datamllab/LongLM?tab=readme-ov-file#3how-to-choose-the-group_size-and-neighbor_window. Hope this may help!
Thank you very much for the prompt response.
I believe considering the position indices that the pre-trained model fails to conduct effectively is an excellent observation. Taking these characteristics into account, it seems that the sensitivity of LongLM's hyper-parameters, as the authors claim, is indeed mild.
However, I still have some reservations about the inequality in Equation 4. If we consider setting the hyperparameters of the pre-trained model's max length (L) to range from 1/2 to 3/2 of its value, shouldn’t the hyperparameters in the caption of Table 5 be adjusted accordingly? For instance, perhaps using a smaller neighbor or a larger group size?
I mean, $L = 2, G_s=4, w_n=512$ is 6.5K, not 4K as stated in Table 5 caption. Would be appreciated any additional clarification you could provide. (btw, this method is really solid and performant)
Thanks in advance :)
If you are asking why we use this setting for 4k, actually, we just selected the two parameters arbitrarily as long as it works well and we never considered whether its maximum extended length (6.5k, as you said) is equal to the target length or not.
The maximum extended length is just a theoretical one, which cannot ensure best performance. It can only ensure the model still can generate something (i.e. Low PPL) rather than random generation. To obtain good performance, you'd better to follow our latest empirical study results (that link).