Xiaosong,He
Xiaosong,He
I have the same question. I guess maybe use the same data process in the Memorizing Transformers(Figure 3)?
@MarkYangjiayi As described Appendix A.2 in FoT paper, maybe FoT does not need the same data process pipeline in Memorizing Transformers. C_curr and C_prev don't represent by batch, instead they...
> @hxs91 My hypothesis is that FoT is using a similar training strategy to Recurrent Memory Transformer, if you want to train a local context of 2k with 4 segments,...
> We haven't compared inference time to longchat since we haven't tried 7b/13b longllama models - they are yet to come. The pros of using our approach is that long...
> Roughly speaking, at inference time LongLLaMA uses only around 10% of the layers for long context. This means, we save about 80-90% of FLOPs spent on attention. When context...
> I'm not sure I understand the question. The experiment I will do is the following: we take input consisting of 128K tokens and feed it into both LLaMA 7B...
> What longllama checkpoint did you use? (there is base v1, base v1.1 and instruct) I agree longllama is a research preview and is not as competitive as closed source...