airllm
airllm copied to clipboard
Position Embedding with Seq > 512
I am trying to run Llama-3.1-8B with a seq > 512, and I get this error. Do I have to manually set position embeddings to get this to work?
from airllm import AutoModel
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3.1-8B", delete_original=True)
prompts = ["a " * 10000 for i in range(100)]
model.tokenizer.pad_token = model.tokenizer.eos_token
input_tokens = model.tokenizer(prompts,
return_tensors="pt",
truncation=True,
padding=True,
max_length=1024,
)
generation_output = model.forward(
input_ids=input_tokens['input_ids'].cuda(),
attention_mask=input_tokens['attention_mask'].cuda(),
use_cache=False,
)
Fetching 13 files: 100%
13/13 [00:00<00:00, 986.88it/s]
found_layers:{'model.embed_tokens.': True, 'model.layers.0.': True, 'model.layers.1.': True, 'model.layers.2.': True, 'model.layers.3.': True, 'model.layers.4.': True, 'model.layers.5.': True, 'model.layers.6.': True, 'model.layers.7.': True, 'model.layers.8.': True, 'model.layers.9.': True, 'model.layers.10.': True, 'model.layers.11.': True, 'model.layers.12.': True, 'model.layers.13.': True, 'model.layers.14.': True, 'model.layers.15.': True, 'model.layers.16.': True, 'model.layers.17.': True, 'model.layers.18.': True, 'model.layers.19.': True, 'model.layers.20.': True, 'model.layers.21.': True, 'model.layers.22.': True, 'model.layers.23.': True, 'model.layers.24.': True, 'model.layers.25.': True, 'model.layers.26.': True, 'model.layers.27.': True, 'model.layers.28.': True, 'model.layers.29.': True, 'model.layers.30.': True, 'model.layers.31.': True, 'model.norm.': True, 'lm_head.': True}
saved layers already found in /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B/snapshots/48d6d0fc4e02fb1269b36940650a1b7233035cbb/splitted_model
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(cuda:0): 3%|▎ | 1/35 [00:01<00:35, 1.04s/it]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-14-c76626e36e7d>](https://localhost:8080/#) in <cell line: 15>()
13 )
14
---> 15 generation_output = model.forward(
16 input_ids=input_tokens['input_ids'].cuda(),
17 attention_mask=input_tokens['attention_mask'].cuda(),
7 frames
[/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py](https://localhost:8080/#) in apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim)
217 cos = cos.unsqueeze(unsqueeze_dim)
218 sin = sin.unsqueeze(unsqueeze_dim)
--> 219 q_embed = (q * cos) + (rotate_half(q) * sin)
220 k_embed = (k * cos) + (rotate_half(k) * sin)
221 return q_embed, k_embed
RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2
fixed with
model.max_seq_len = 1024
model.init_model()
maybe should be in init option