Long-Context-Data-Engineering icon indicating copy to clipboard operation
Long-Context-Data-Engineering copied to clipboard

Small correction for YaRN-Mistral model

Open bloc97 opened this issue 1 year ago • 2 comments

Hello! Author of YaRN here. First of all thank you for this very comprehensive paper on data engineering challenges for long context LLMs. It will certainly be very useful for the research community in the quest of training better and more robust long context models!

However, there's been a small confusion on how the YaRN Mistral 7B 128K model was trained (Fig. 1 of the paper), this model was trained on a 16k context length dataset without length upsampling (the dataset used is a derivative of what TogetherAI used to train their 32k model, but chunked to 16k instead). The Llama 2 7B 128K model is the one that was trained on PG19, chunked in a context of 64k (not 128k), which I think would be a more appropriate comparison, there's simply too many confounding variables with our Mistral YaRN models.

Also, the reason that we were able to get away with training with such a small context (16k) is because YaRN exhibits the behaviour necessary for context length extrapolation even without finetuning (albeit not very good and only for small extension scale ratios).

Unfortunately, the passkey evaluation that we used was much more easy compared to the Needle-in-a-Haystack test (didn't exist back then), we originally did not notice any degradation of long context capabilities by shortening the dataset from 128k to 64k then to 16k (cheaper to train), but with the newer Needle-in-a-Haystack tests, the degradation is apparent. We will certainly be trying out the new methods outlined in this paper for future finetunes!

bloc97 avatar Feb 19 '24 23:02 bloc97