Konrad Staniszewski
Konrad Staniszewski
Hi, thanks for the request. In the recent commit, I have added initial support for gradient checkpointing (it just skips memory layers). As I am writing, it is not yet...
I apologize for the late response. We have recently published the [code](https://github.com/CStanKonrad/long_llama/tree/main/fine_tuning) that allows for fine-tuning the model on a single A100 80GB GPU. We use a total context size...
Thank you for your question! The reason behind this is that most examples from OpenOrca and MathInstruct should fit within this context length (only chat examples were longer, but as...
Thank you for your question! Yes, [this Colab](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb) contains a demo where the model loads our paper and is asked questions about it. The paper is far longer than 2K...
Hi! I think that you now can send me messages on twitter https://twitter.com/CStanKonrad, however, I may be unavailable during the weekend. I have recently checked and the code below works...
As mentioned in the [readme](https://github.com/CStanKonrad/long_llama/tree/main/fine_tuning#misc) the instruction fine-tuning does not use FoT. In fact, it can be thought of as a "modified" FoT with `cross_batch=1` because: * We take the...
I apologize for the late response and delay in the publication of the continued pre-training code. The FoT continued pre-training code is now available [here](https://github.com/CStanKonrad/long_llama/tree/main/fot_continued_pretraining). A brief explanation of this...
Hi, thanks for the question. Briefly speaking, we have not tried using scaled positional encodings and FoT attention, so we cannot comment on performance. Originally FoT was designed to allow...
Regarding the question, the suggested implementation of kNN retrieves for each query in the memory layer k most matching keys from the memory cache. In the 3B model, there are...
Our method should be faster than Hugging Face LLaMA as it uses extended context only in 3 (out of 26 in the case of the 3B model) layers. For example,...