ChunkLlama
ChunkLlama copied to clipboard
Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
Hi guys, thank you for this excellent work! It seems that this code does not consider the attention_mask during chunkllama inference. Does this code support batch inference?
Exciting work, I am very interested, but since my coding ability is weak, can you provide a CUDA code about DCA, it will be greatly appreciated
How can I use this approach in vllm deployment without training,can you give me a specific example. thx
`Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:04
fix "repalce" to "replace"
I noticed that you compared many models in your paper. Could you please share the implementation code for the training-free models mentioned in Table 1, or the GitHub repository you...
This PR fixes a single-character typo in the library name. | Before (MyMuPDF does not exist) | After (using correct name PyMuPDF) | | - | - | | |...
When i use Llama3 (with flash decoding ) to run run_chunkllama_100k, and it can successful start. But when i input prompt. then encounter a TypeError that: ``` File "ChunkLlama/flash_decoding_chunkllama.py", line...
Hello, Thank you for providing DCA to scale model context up to 100K+. However, I encountered an issue when trying to inference with a 128k context using 4 GPUs. The...
Have not from transformers.modeling_attn_mask_utils import **_prepare_4d_causal_attention_mask_for_sdpa** flash_decoding_chunkllama.py: - Ln 510: attention_mask = _prepare_4d_causal_attention_mask_for_sdpa( - Ln7: from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask