ChunkLlama icon indicating copy to clipboard operation
ChunkLlama copied to clipboard

Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"

Results 14 ChunkLlama issues
Sort by recently updated
recently updated
newest added

Hi guys, thank you for this excellent work! It seems that this code does not consider the attention_mask during chunkllama inference. Does this code support batch inference?

Exciting work, I am very interested, but since my coding ability is weak, can you provide a CUDA code about DCA, it will be greatly appreciated

How can I use this approach in vllm deployment without training,can you give me a specific example. thx

`Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:04

fix "repalce" to "replace"

I noticed that you compared many models in your paper. Could you please share the implementation code for the training-free models mentioned in Table 1, or the GitHub repository you...

This PR fixes a single-character typo in the library name. | Before (MyMuPDF does not exist) | After (using correct name PyMuPDF) | | - | - | | |...

When i use Llama3 (with flash decoding ) to run run_chunkllama_100k, and it can successful start. But when i input prompt. then encounter a TypeError that: ``` File "ChunkLlama/flash_decoding_chunkllama.py", line...

Hello, Thank you for providing DCA to scale model context up to 100K+. However, I encountered an issue when trying to inference with a 128k context using 4 GPUs. The...

Have not from transformers.modeling_attn_mask_utils import **_prepare_4d_causal_attention_mask_for_sdpa** flash_decoding_chunkllama.py: - Ln 510: attention_mask = _prepare_4d_causal_attention_mask_for_sdpa( - Ln7: from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask