Results 11 comments of Mooler0410

> As the paper mentioned, self-Extend do not support flash-attn. We recently added flash-attention support for Selfextend

Author here, glad to answer any questions about details for our work.

Llama.cpp has supported SelfExtend and had a good implementation. It uses GGUF models. SelfExtend has obtained pretty positive feedback from Llama.cpp's community. You can check their repo for more details.

Hi! We have some empirical results about this. You may check out this link: [https://github.com/datamllab/LongLM?tab=readme-ov-file#3how-to-choose-the-group_size-and-neighbor_window](https://github.com/datamllab/LongLM?tab=readme-ov-file#3how-to-choose-the-group_size-and-neighbor_window). Hope this may help!

If you are asking why we use this setting for 4k, actually, we just selected the two parameters arbitrarily as long as it works well and we never considered whether...

We believe how good is self extend highly depends on how good is the extended model within its original pretraining context window. This means if Qwen1.5's 32k context window is...

We are not very familiar with vLLM and its internal mechanism. We will check its compatibility with SelfExtend. Thanks for your suggestion!

> I followed your direction like the below to apply selfextend to llama3 > > """ > > [04/19/2024]:💡 We added the support for LLama-3 with transformers==4.40. To use it...

Hi! We just implemented FlashAttention for self-extend utilizing the window FA supported by flash_attn. In a word, we merge two FA together to get the attention of self-extend. Check https://github.com/datamllab/LongLM/pull/28...

> v0.1 only supports 8K token length, which leads to low performance. We use v0.2 because it supports 32K tokens. The first 3 subsets of BANKING77 are below 8k. So,...