Konrad Staniszewski comments

Results 12 comments of


                                            Konrad Staniszewski

Support for gradient_checkpointing

Hi, thanks for the request. In the recent commit, I have added initial support for gradient checkpointing (it just skips memory layers). As I am writing, it is not yet...

Support for gradient_checkpointing

I apologize for the late response. We have recently published the [code](https://github.com/CStanKonrad/long_llama/tree/main/fine_tuning) that allows for fine-tuning the model on a single A100 80GB GPU. We use a total context size...

Need clarification on token limit of input used for fine tuning

Thank you for your question! The reason behind this is that most examples from OpenOrca and MathInstruct should fit within this context length (only chat examples were longer, but as...

0-shot long-context summarization / QA inference

Thank you for your question! Yes, [this Colab](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb) contains a demo where the model loads our paper and is asked questions about it. The paper is far longer than 2K...

I have some questions

Hi! I think that you now can send me messages on twitter https://twitter.com/CStanKonrad, however, I may be unavailable during the weekend. I have recently checked and the code below works...

How is the contrastive data pipeline implemented?

As mentioned in the [readme](https://github.com/CStanKonrad/long_llama/tree/main/fine_tuning#misc) the instruction fine-tuning does not use FoT. In fact, it can be thought of as a "modified" FoT with `cross_batch=1` because: * We take the...

How is the contrastive data pipeline implemented?

I apologize for the late response and delay in the publication of the continued pre-training code. The FoT continued pre-training code is now available [here](https://github.com/CStanKonrad/long_llama/tree/main/fot_continued_pretraining). A brief explanation of this...

Konrad Staniszewski

Support for gradient_checkpointing

Support for gradient_checkpointing

Need clarification on token limit of input used for fine tuning

0-shot long-context summarization / QA inference

I have some questions

How is the contrastive data pipeline implemented?

How is the contrastive data pipeline implemented?

FoT attention and the scaling trick

Does each token requires KNN search during inference?

How's the speed droping when length get large compare with vanilla llama?