[Core] generate from input embeds
adds support for passing prompt_embeds to LLM.generate as
llm.generate({"prompt_embeds": input_embeds}, sampling_params)
or
llm.generate(
[{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers
FIX #416 FIX #8323
inspired by https://github.com/vllm-project/vllm/pull/1265 which is very outdated
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.
Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).
To run full CI, you can do one of these:
- Comment
/readyon the PR - Add
readylabel to the PR - Enable auto-merge.
🚀
@WoosukKwon @ywang96 @robertgshaw2-neuralmagic
the failed tests with
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
seems not related to my changes and I can't reproduce it locally.
other than that, this is ready for review
@WoosukKwon @ywang96 @robertgshaw2-neuralmagic
the failed tests with
ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templatingseems not related to my changes and I can't reproduce it locally.other than that, this is ready for review
This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by https://github.com/vllm-project/vllm/pull/7238. Can you merge your branch with main again?
ready for review @ywang96 @WoosukKwon @robertgshaw2-neuralmagic
script I ran for testing:
# %%
ASYNC = True
USE_RAY = False
# %%
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cpu"
model_path = "/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2" # "/models/huggingface/meta-llama/llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map=device, torch_dtype=torch.bfloat16
).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
embed = model.get_input_embeddings()
if not tokenizer.pad_token:
if tokenizer.eos_token:
tokenizer.add_special_tokens(
{
"pad_token": tokenizer.eos_token,
}
)
else:
tokenizer.add_special_tokens({"pad_token": "<PAD>"})
model.resize_token_embeddings(model.config.vocab_size + 1)
# %%
from vllm import LLM, SamplingParams, AsyncLLMEngine, AsyncEngineArgs
from pprint import pprint
if ASYNC:
llm = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(
model="/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2",
tensor_parallel_size=4,
distributed_executor_backend='ray' if USE_RAY else 'mp',
)
)
else:
llm = LLM(model="/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2",
distributed_executor_backend='ray' if USE_RAY else 'mp',
tensor_parallel_size=4,
)
# %%
prompts = [
"Hello, my name is",
"The president of the United States is",
"My favorite book of all time is",
"When I wake up in the morning, the first thing I do is",
"The best vacation I ever took was to",
"If I could have any superpower, it would be",
"One thing on my bucket list is",
"My go-to comfort food is",
"The most inspiring person in history, in my opinion, is",
"When I was a child, I wanted to grow up to be",
"If I could live in any period of history, I would choose",
"A skill I've always wanted to learn but haven't yet is",
"The most memorable movie quote for me is",
"In my free time, I like to",
"If I could meet any fictional character, I would choose",
"The last dream I remember having was about",
"My favorite season of the year is",
"One thing I wish more people knew about me is",
"The most delicious meal I've ever had was",
"If I could travel anywhere in the world right now, I would go to",
"A hobby I have that most people don't know about is",
"The best piece of advice I've ever received is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=10)
# %%
inputs_embeds = []
for prompt in prompts:
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
input_embeds = embed(input_ids).squeeze(0)
pprint(input_embeds.shape)
inputs_embeds.append(input_embeds)
# %%
from uuid import uuid4
if ASYNC:
results = []
for prompt in prompts:
async for request_output in llm.generate(prompt, sampling_params, str(uuid4())):
pass
results.append(request_output)
else:
results = llm.generate(prompts, sampling_params)
pprint(results)
# %%
if ASYNC:
async for request_output in llm.generate({"prompt_embeds": inputs_embeds[0]}, sampling_params, str(uuid4())):
results = request_output
else:
results = llm.generate({"prompt_embeds": inputs_embeds[0]}, sampling_params)
pprint(results)
# %%
if ASYNC:
results = []
for input_embeds in inputs_embeds:
async for request_output in llm.generate({"prompt_embeds": input_embeds}, sampling_params, str(uuid4())):
pass
results.append(request_output)
else:
results = llm.generate(
[{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)
pprint(results)
# %%
from random import shuffle
mixed_inputs = [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds[:10]] + prompts[10:]
shuffle(mixed_inputs)
if ASYNC:
results = []
for mixed_input in mixed_inputs:
async for request_output in llm.generate(mixed_input, sampling_params, str(uuid4())):
pass
results.append(request_output)
else:
results = llm.generate(
mixed_inputs,
sampling_params,
)
pprint(results)
Thanks for the great work. So I want to try this PR like below, and my model is self-defined LlamaForCausalLM.
outputs = llm.generate({"prompt_embeds": input_token_embedding},
sampling_params)
However, I got the error
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[7], line 1
----> 1 outputs = llm.generate({"prompt_embeds": input_token_embedding},
2 sampling_params)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/utils.py:1032, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
1025 msg += f" {additional_message}"
1027 warnings.warn(
1028 DeprecationWarning(msg),
1029 stacklevel=3, # The inner function takes up one level
1030 )
-> 1032 return fn(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/entrypoints/llm.py:347, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request, guided_options_request)
338 sampling_params = SamplingParams()
340 self._validate_and_add_requests(
341 inputs=inputs,
342 params=sampling_params,
343 lora_request=lora_request,
344 prompt_adapter_request=prompt_adapter_request,
345 guided_options=guided_options_request)
--> 347 outputs = self._run_engine(use_tqdm=use_tqdm)
348 return LLMEngine.validate_outputs(outputs, RequestOutput)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/entrypoints/llm.py:704, in LLM._run_engine(self, use_tqdm)
702 total_out_toks = 0
703 while self.llm_engine.has_unfinished_requests():
--> 704 step_outputs = self.llm_engine.step()
705 for output in step_outputs:
706 if output.finished:
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/engine/llm_engine.py:1570, in LLMEngine.step(self)
1566 if allow_async_output_proc:
1567 execute_model_req.async_callback = self.async_callbacks[
1568 virtual_engine]
-> 1570 output = self.model_executor.execute_model(
1571 execute_model_req=execute_model_req)
1573 # We need to do this here so that last step's sampled_token_ids can
1574 # be passed to the next iteration for PP.
1575 if self.scheduler_config.is_multi_step:
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:130, in GPUExecutor.execute_model(self, execute_model_req)
127 def execute_model(
128 self, execute_model_req: ExecuteModelRequest
129 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 130 output = self.driver_worker.execute_model(execute_model_req)
131 return output
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/worker/worker_base.py:327, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
322 if (self.observability_config is not None
323 and self.observability_config.collect_model_execute_time):
324 orig_model_execute_time = intermediate_tensors.tensors.get(
325 "model_execute_time", torch.tensor(0)).item()
--> 327 output = self.model_runner.execute_model(
328 model_input=model_input,
329 kv_caches=self.kv_cache[worker_input.virtual_engine]
330 if self.kv_cache is not None else None,
331 intermediate_tensors=intermediate_tensors,
332 num_steps=num_steps,
333 **kwargs,
334 )
336 model_execute_time = time.perf_counter() - start_time
337 if not get_pp_group().is_last_rank:
338 # output is IntermediateTensors
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/worker/model_runner.py:1505, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
1501 if self.model_supports_input_embeds:
1502 model_params.update(
1503 inputs_embeds=model_input.input_embeds,
1504 inputs_embeds_masks=model_input.input_embeds_masks)
-> 1505 hidden_or_intermediate_states = model_executable(**model_params)
1507 if (self.observability_config is not None
1508 and self.observability_config.collect_model_forward_time):
1509 model_forward_end.record()
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:433, in LlamaForCausalLM.forward(self, input_ids, positions, kv_caches, attn_metadata, intermediate_tensors, inputs_embeds, inputs_embeds_masks)
423 def forward(
424 self,
425 input_ids: torch.Tensor,
(...)
431 inputs_embeds_masks: Optional[torch.Tensor] = None,
432 ) -> Union[torch.Tensor, IntermediateTensors]:
--> 433 model_output = self.model(input_ids, positions, kv_caches,
434 attn_metadata, intermediate_tensors,
435 inputs_embeds, inputs_embeds_masks)
436 return model_output
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:331, in LlamaModel.forward(self, input_ids, positions, kv_caches, attn_metadata, intermediate_tensors, inputs_embeds, inputs_embeds_masks)
329 for i in range(self.start_layer, self.end_layer):
330 layer = self.layers[i]
--> 331 hidden_states, residual = layer(
332 positions,
333 hidden_states,
334 kv_caches[i - self.start_layer],
335 attn_metadata,
336 residual,
337 )
339 if not get_pp_group().is_last_rank:
340 return IntermediateTensors({
341 "hidden_states": hidden_states,
342 "residual": residual
343 })
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:252, in LlamaDecoderLayer.forward(self, positions, hidden_states, kv_cache, attn_metadata, residual)
249 else:
250 hidden_states, residual = self.input_layernorm(
251 hidden_states, residual)
--> 252 hidden_states = self.self_attn(
253 positions=positions,
254 hidden_states=hidden_states,
255 kv_cache=kv_cache,
256 attn_metadata=attn_metadata,
257 )
259 # Fully Connected
260 hidden_states, residual = self.post_attention_layernorm(
261 hidden_states, residual)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:182, in LlamaAttention.forward(self, positions, hidden_states, kv_cache, attn_metadata)
180 q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
181 q, k = self.rotary_emb(positions, q, k)
--> 182 attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
183 output, _ = self.o_proj(attn_output)
184 return output
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/attention/layer.py:98, in Attention.forward(self, query, key, value, kv_cache, attn_metadata, attn_type)
88 def forward(
89 self,
90 query: torch.Tensor,
(...)
95 attn_type: AttentionType = AttentionType.DECODER,
96 ) -> torch.Tensor:
---> 98 return self.impl.forward(query,
99 key,
100 value,
101 kv_cache,
102 attn_metadata,
103 self._k_scale,
104 self._v_scale,
105 attn_type=attn_type)
File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/attention/backends/xformers.py:574, in XFormersImpl.forward(self, query, key, value, kv_cache, attn_metadata, k_scale, v_scale, attn_type)
569 num_decode_tokens = 0
571 if attn_type == AttentionType.DECODER:
572 # Only enforce this shape-constraint for decoder
573 # self-attention
--> 574 assert key.shape[0] == num_prefill_tokens + num_decode_tokens
575 assert value.shape[0] == num_prefill_tokens + num_decode_tokens
577 output = torch.empty_like(query)
A few models have been added since this PR. Can you go through the models and check that all of them can support this input?
Here is a bug:
vllm/inputs/preprocess.py line333 will set prompt_token_ids=[] but vllm/engine/llm_engine.py line1721 is: if prompt_ids is None
Traceback (most recent call last):
File "/mnt/bn/integrated-risk-model2/LLM_Inference_Service/llmserver_diy/llmserver/core/vanilla_vllm/vanilla_vllm_scheduler.py", line 42, in inner
async for result in func(*args, **kwargs):
File "/mnt/bn/integrated-risk-model2/LLM_Inference_Service/llmserver_diy/llmserver/core/vanilla_vllm/vanilla_vllm_scheduler.py", line 160, in generate
async for result in self.vanilla_vllm_engine.generate(*generate_input):
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 882, in generate
async for output in await self.add_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 109, in generator
raise result
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 666, in engine_step
await self.engine.add_request_async(**new_request)
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 430, in add_request_async
self._add_processed_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 627, in _add_processed_request
self._validate_model_inputs(processed_inputs)
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 1730, in _validate_model_inputs
raise ValueError("You can only provide either tokens or "
ValueError: You can only provide either tokens or embeddings, not both****
this change will work: if prompt_ids is None or len(prompt_ids)==0:
And another bug I do not find reason yet Using InternLM2 model Input:
prompt_embeds is a torch.Tensor, it's torch.Size([1777, 4096])
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=1, logprobs=20, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)
Error:
ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last):
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 09-20 01:48:34 worker_base.py:464] return func(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 09-20 01:48:34 worker_base.py:464] hidden_or_intermediate_states = model_executable(**model_params)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
ERROR 09-20 01:48:34 worker_base.py:464] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
ERROR 09-20 01:48:34 worker_base.py:464] hidden_states, residual = layer(
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
ERROR 09-20 01:48:34 worker_base.py:464] hidden_states = self.attention(
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
ERROR 09-20 01:48:34 worker_base.py:464] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-20 01:48:34 worker_base.py:464] return self.impl.forward(query,
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
ERROR 09-20 01:48:34 worker_base.py:464] assert key.shape[0] == num_prefill_tokens + num_decode_tokens
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] AssertionError
ERROR 09-20 01:48:34 worker_base.py:464]
ERROR 09-20 01:48:34 worker_base.py:464] During handling of the above exception, another exception occurred:
ERROR 09-20 01:48:34 worker_base.py:464]
ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last):
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-20 01:48:34 worker_base.py:464] return executor(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-20 01:48:34 worker_base.py:464] output = self.model_runner.execute_model(
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-20 01:48:34 worker_base.py:464] return func(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
ERROR 09-20 01:48:34 worker_base.py:464] pickle.dump(dumped_inputs, filep)
ERROR 09-20 01:48:34 worker_base.py:464] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
ERROR 09-20 01:48:34 async_llm_engine.py:61] Engine background task failed
ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 09-20 01:48:34 async_llm_engine.py:61] return func(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_or_intermediate_states = model_executable(**model_params)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_states, residual = layer(
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_states = self.attention(
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61] attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61] return self.impl.forward(query,
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61] assert key.shape[0] == num_prefill_tokens + num_decode_tokens
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] AssertionError
ERROR 09-20 01:48:34 async_llm_engine.py:61]
ERROR 09-20 01:48:34 async_llm_engine.py:61] During handling of the above exception, another exception occurred:
ERROR 09-20 01:48:34 async_llm_engine.py:61]
ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
ERROR 09-20 01:48:34 async_llm_engine.py:61] return_value = task.result()
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
ERROR 09-20 01:48:34 async_llm_engine.py:61] result = task.result()
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
ERROR 09-20 01:48:34 async_llm_engine.py:61] request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
ERROR 09-20 01:48:34 async_llm_engine.py:61] outputs = await self.model_executor.execute_model_async(
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61] return await super().execute_model_async(execute_model_req)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61] return await self._driver_execute_model_async(execute_model_req)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61] return await self.driver_exec_method("execute_model",
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 09-20 01:48:34 async_llm_engine.py:61] result = self.fn(*self.args, **self.kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
ERROR 09-20 01:48:34 async_llm_engine.py:61] raise e
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-20 01:48:34 async_llm_engine.py:61] return executor(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-20 01:48:34 async_llm_engine.py:61] output = self.model_runner.execute_model(
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-20 01:48:34 async_llm_engine.py:61] return func(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
ERROR 09-20 01:48:34 async_llm_engine.py:61] pickle.dump(dumped_inputs, filep)
ERROR 09-20 01:48:34 async_llm_engine.py:61] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
2024-09-20 01:48:34.319 base_events.py:1771 [ERROR]: Exception in callback _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41
handle: <Handle _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41>
Traceback (most recent call last):
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
hidden_or_intermediate_states = model_executable(**model_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
hidden_states, residual = layer(
^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
hidden_states = self.attention(
^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
assert key.shape[0] == num_prefill_tokens + num_decode_tokens
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
return await super().execute_model_async(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
return await self.driver_exec_method("execute_model",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
raise e
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
pickle.dump(dumped_inputs, filep)
AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 63, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Traceback (most recent call last):
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
hidden_or_intermediate_states = model_executable(**model_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
hidden_states, residual = layer(
^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
hidden_states = self.attention(
^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
return self.impl.forward(query,
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
assert key.shape[0] == num_prefill_tokens + num_decode_tokens
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 882, in generate
async for output in await self.add_request(
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 109, in generator
raise result
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
return await super().execute_model_async(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
return await self.driver_exec_method("execute_model",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
raise e
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
return executor(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
output = self.model_runner.execute_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
pickle.dump(dumped_inputs, filep)
AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
And another bug I do not find reason yet Using InternLM2 model Input:
prompt_embeds is a torch.Tensor, it's torch.Size([1777, 4096]) SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=1, logprobs=20, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)Error:
ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last): ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper ERROR 09-20 01:48:34 worker_base.py:464] return func(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model ERROR 09-20 01:48:34 worker_base.py:464] hidden_or_intermediate_states = model_executable(**model_params) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward ERROR 09-20 01:48:34 worker_base.py:464] hidden_states = self.model(input_ids, positions, kv_caches, ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward ERROR 09-20 01:48:34 worker_base.py:464] hidden_states, residual = layer( ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward ERROR 09-20 01:48:34 worker_base.py:464] hidden_states = self.attention( ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward ERROR 09-20 01:48:34 worker_base.py:464] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 worker_base.py:464] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 worker_base.py:464] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward ERROR 09-20 01:48:34 worker_base.py:464] return self.impl.forward(query, ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward ERROR 09-20 01:48:34 worker_base.py:464] assert key.shape[0] == num_prefill_tokens + num_decode_tokens ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] AssertionError ERROR 09-20 01:48:34 worker_base.py:464] ERROR 09-20 01:48:34 worker_base.py:464] During handling of the above exception, another exception occurred: ERROR 09-20 01:48:34 worker_base.py:464] ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last): ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method ERROR 09-20 01:48:34 worker_base.py:464] return executor(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model ERROR 09-20 01:48:34 worker_base.py:464] output = self.model_runner.execute_model( ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 09-20 01:48:34 worker_base.py:464] return func(*args, **kwargs) ERROR 09-20 01:48:34 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 worker_base.py:464] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper ERROR 09-20 01:48:34 worker_base.py:464] pickle.dump(dumped_inputs, filep) ERROR 09-20 01:48:34 worker_base.py:464] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound' ERROR 09-20 01:48:34 async_llm_engine.py:61] Engine background task failed ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last): ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper ERROR 09-20 01:48:34 async_llm_engine.py:61] return func(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_or_intermediate_states = model_executable(**model_params) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_states = self.model(input_ids, positions, kv_caches, ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_states, residual = layer( ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward ERROR 09-20 01:48:34 async_llm_engine.py:61] hidden_states = self.attention( ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward ERROR 09-20 01:48:34 async_llm_engine.py:61] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return self._call_impl(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl ERROR 09-20 01:48:34 async_llm_engine.py:61] return forward_call(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward ERROR 09-20 01:48:34 async_llm_engine.py:61] return self.impl.forward(query, ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward ERROR 09-20 01:48:34 async_llm_engine.py:61] assert key.shape[0] == num_prefill_tokens + num_decode_tokens ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] AssertionError ERROR 09-20 01:48:34 async_llm_engine.py:61] ERROR 09-20 01:48:34 async_llm_engine.py:61] During handling of the above exception, another exception occurred: ERROR 09-20 01:48:34 async_llm_engine.py:61] ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last): ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion ERROR 09-20 01:48:34 async_llm_engine.py:61] return_value = task.result() ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop ERROR 09-20 01:48:34 async_llm_engine.py:61] result = task.result() ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step ERROR 09-20 01:48:34 async_llm_engine.py:61] request_outputs = await self.engine.step_async(virtual_engine) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async ERROR 09-20 01:48:34 async_llm_engine.py:61] outputs = await self.model_executor.execute_model_async( ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async ERROR 09-20 01:48:34 async_llm_engine.py:61] return await super().execute_model_async(execute_model_req) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async ERROR 09-20 01:48:34 async_llm_engine.py:61] return await self._driver_execute_model_async(execute_model_req) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async ERROR 09-20 01:48:34 async_llm_engine.py:61] return await self.driver_exec_method("execute_model", ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run ERROR 09-20 01:48:34 async_llm_engine.py:61] result = self.fn(*self.args, **self.kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method ERROR 09-20 01:48:34 async_llm_engine.py:61] raise e ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method ERROR 09-20 01:48:34 async_llm_engine.py:61] return executor(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model ERROR 09-20 01:48:34 async_llm_engine.py:61] output = self.model_runner.execute_model( ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 09-20 01:48:34 async_llm_engine.py:61] return func(*args, **kwargs) ERROR 09-20 01:48:34 async_llm_engine.py:61] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-20 01:48:34 async_llm_engine.py:61] File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper ERROR 09-20 01:48:34 async_llm_engine.py:61] pickle.dump(dumped_inputs, filep) ERROR 09-20 01:48:34 async_llm_engine.py:61] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound' 2024-09-20 01:48:34.319 base_events.py:1771 [ERROR]: Exception in callback _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41 handle: <Handle _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41> Traceback (most recent call last): File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model hidden_or_intermediate_states = model_executable(**model_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward hidden_states, residual = layer( ^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward hidden_states = self.attention( ^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward return self.impl.forward(query, ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward assert key.shape[0] == num_prefill_tokens + num_decode_tokens ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion return_value = task.result() ^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop result = task.result() ^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step request_outputs = await self.engine.step_async(virtual_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async outputs = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async return await super().execute_model_async(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async return await self._driver_execute_model_async(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async return await self.driver_exec_method("execute_model", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method raise e File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method return executor(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model output = self.model_runner.execute_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper pickle.dump(dumped_inputs, filep) AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 63, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. Traceback (most recent call last): File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model hidden_or_intermediate_states = model_executable(**model_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward hidden_states, residual = layer( ^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward hidden_states = self.attention( ^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward return self.impl.forward(query, ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward assert key.shape[0] == num_prefill_tokens + num_decode_tokens ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 882, in generate async for output in await self.add_request( File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 109, in generator raise result File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion return_value = task.result() ^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop result = task.result() ^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step request_outputs = await self.engine.step_async(virtual_engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async outputs = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async return await super().execute_model_async(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async return await self._driver_execute_model_async(execute_model_req) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async return await self.driver_exec_method("execute_model", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method raise e File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method return executor(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model output = self.model_runner.execute_model( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper pickle.dump(dumped_inputs, filep) AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
Are you using speculative decoding? It's not supported with input embeds yet.
Are you using speculative decoding? It's not supported with input embeds yet.
No, here is my config
AsyncEngineArgs(model='./internlm2-chat-7b', served_model_name=None, tokenizer='./internlm2-chat-7b', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=None, worker_use_ray=True, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=True)
I cloned this using gh pr checkout 6869 on latest vllm and looks like there's a bug? Basically, inputs processing is broken. When I add
print(f'inputs {inputs}\n preprocessed_inputs {preprocessed_inputs} \n processed_inputs {processed_inputs}')
right before self._add_processed_request call (vllm/engine/llm_engine.py:749) I get this output(for readability deleted tensor content):
inputs {'prompt_embeds': tensor([[ 0.0024, 0.0022]],
grad_fn=<EmbeddingBackward0>)}
preprocessed_inputs {'prompt_token_ids': [], 'prompt': None, 'prompt_embeds': tensor([[ 0.0024, 0.0022]],
grad_fn=<EmbeddingBackward0>), 'multi_modal_data': None}
processed_inputs {'prompt_token_ids': [], 'prompt': None, 'prompt_embeds': tensor([[ 0.0024,0.0022]],
grad_fn=<EmbeddingBackward0>), 'multi_modal_data': None}
As you can see, prompt_token_ids is no longer None, and it fails at LLMEngine._validate_model_inputs with ValueError: You can only provide either tokens or embeddings, not both
I tried to fix that. but since I'm not sure how this all works I didn't succeded. Ignoring input validation and setting processed_inputs['prompt_token_ids'] = [] causes RuntimeError: CUDA error: device-side assert triggered error, and setting it to None causes error vllm/sequence.py:418 :
SequenceData(
array(VLLM_TOKEN_ID_ARRAY_TYPE, self.prompt_token_ids),
self.prompt_embeds)
TypeError: 'NoneType' object is not iterable
I tried two different models, both use LlamaForCausalLM and works fine in 'vanila' vLLM. I tried running it without any sampling parameters, as well as without quantizations, never managed to get it working.
Yeah, it's bugged now. Going to resume work on this later today. Sorry for breaking this!
I found some deeper issues with how the inputs are being processed right now and am working on a refactor.
I have finally fixed input processor to work with embedding inputs. To reduce the scope of this PR, I'll split out some of the changes (in particular the renaming) to other PRs.
thank you @DarkLight1337 ! Are there anything else you would like me to do for this pr?
Let's wait for my other PRs to be merged first.
I know that this is still in the works, but I tried it before recent merges and after, and both times I got some errors of more or less the same content, so wanted to report an issue with this PR (or please say if I'm doing something really wrong).
What I was doing:
import vllm.vllm as vllm
from vllm.vllm import LLM, SamplingParams
from transformers import AutoTokenizer, LlamaForCausalLM
import copy
vicuna_path = Path("...")
llm = LLM(model=vicuna_path, max_model_len=1500)
# embeddings generation
model = LlamaForCausalLM.from_pretrained(vicuna_path)
embeddings = copy.deepcopy(model.model.embed_tokens)
del(model)
gc.collect()
message = "USER: Count from 1 to 10 please \nASSISTANT:"
input_ids = tokenizer(message, return_tensors="pt")
embed_tokens = embeddings(input_ids['input_ids'])
embed_tokens.shape # torch.Size([1, 19, 4096])
# skipped sampling params here for simplicity
outputs = llm.generate({'prompt_embeds': embed_tokens})
I get RuntimeError: CUDA error: an illegal memory access was encountered which is probably caused by
File /media/data/agafonov/repos/allm_service/vllm/vllm/attention/backends/flash_attn.py:682, in FlashAttentionImpl.forward(self, query, key, value, kv_cache, attn_metadata, k_scale, v_scale, attn_type)
679 assert k_scale == 1.0 and v_scale == 1.0, (
680 "key/v_scale is not supported in FlashAttention.")
--> 682 num_tokens, hidden_size = query.shape
683 # Reshape the query, key, and value tensors.
ValueError: too many values to unpack (expected 2)
(end of stack trace)
After error I can't use that gpu untill I reload my jupyter kernel.
vllm: last main version(0.6.2) into gh pr checkout 6869, hardware: 4060ti, linux, cuda 12.1. Usual vllm afrer checking out this PR works fine as far as I can see.
Oh, I found mistake I made. Basically, this MR expects embeds as tensor without batch, but transformers llm use batched input.
print(f'embeddings shape: {embeds.shape}')
output = self.llama_model.generate(
inputs_embeds=embeds,...
)
# embeddings shape: torch.Size([1, 112, 4096])
So it works with (batch_len, tokens, emb_dim) tensors, meanwhile your code expects each tensor to be shape (tokens, emb_dim). I added torch.squeeze(embed_tokens) and it worked with vllm.
Not sure if that needs to be fixed or not, but for idiots like me when you merge this add that expected no batch dimension info in docs (:
Oh, I found mistake I made. Basically, this MR expects embeds as tensor without batch, but transformers llm use batched input.
print(f'embeddings shape: {embeds.shape}') output = self.llama_model.generate( inputs_embeds=embeds,... ) # embeddings shape: torch.Size([1, 112, 4096])So it works with (batch_len, tokens, emb_dim) tensors, meanwhile your code expects each tensor to be shape (tokens, emb_dim). I added
torch.squeeze(embed_tokens)and it worked with vllm.Not sure if that needs to be fixed or not, but for idiots like me when you merge this add that expected no batch dimension info in docs (:
Sorry for the late response. I'll certainly add a check for that!
Update: Done!
@DarkLight1337 Is there any update on this PR? I'm very excited to use this feature.
Thanks for your interest. I'm still waiting for #8688 and a subsequent PR that refactors input processing to be merged first.
I tried the newest pr using cmd 'gh pr checkout 6869', but seems failed:
[rank0]: Traceback (most recent call last):
[rank0]: File "internlm2/embed_vllm.py.py", line 45, in <module>
[rank0]: File "vllm-master/vllm/vllm/utils.py", line 1051, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/entrypoints/llm.py", line 391, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/entrypoints/llm.py", line 899, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/engine/llm_engine.py", line 1356, in step
[rank0]: ) = self.scheduler[virtual_engine].schedule()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/core/scheduler.py", line 1218, in schedule
[rank0]: scheduler_outputs: SchedulerOutputs = self._schedule()
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/core/scheduler.py", line 1178, in _schedule
[rank0]: return self._schedule_default()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/core/scheduler.py", line 1013, in _schedule_default
[rank0]: prefills = self._schedule_prefills(budget,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/core/scheduler.py", line 949, in _schedule_prefills
[rank0]: self._allocate_and_set_running(seq_group)
[rank0]: File "vllm-master/vllm/vllm/core/scheduler.py", line 1411, in _allocate_and_set_running
[rank0]: self.block_manager.allocate(seq_group)
[rank0]: File "vllm-master/vllm/vllm/core/block_manager_v2.py", line 169, in allocate
[rank0]: block_table: BlockTable = self._allocate_sequence(seq)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "vllm-master/vllm/vllm/core/block_manager_v2.py", line 155, in _allocate_sequence
[rank0]: block_table.allocate(seq.get_token_ids())
[rank0]: File "vllm-master/vllm/vllm/core/block/block_table.py", line 95, in allocate
[rank0]: assert token_ids
[rank0]: AssertionError
I tried the newest pr using cmd 'gh pr checkout 6869', but seems failed:
We will likely need another pass to get this PR to work with blockmanagerv2. I'm still waiting for previous PRs to be merged.
@qthequartermasterman Sorry for the delay, ~~I think I have fixed the breaking issues now~~. #9604 is the last PR to be merged before we can move forward with this PR.
I'm still unable to pass the assertion of num_new_tokens > 0. For input embeddings, there are no token inputs to the model, so we may need to update the existing code to accommodate this. However this will likely interfere with the current re-arch efforts. @WoosukKwon @comaniac any suggestions?
@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API
@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API
Is this an existing endpoint from OpenAI? What would the schema look like in this case?
@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API
Is this an existing endpoint from OpenAI? What would the schema look like in this case?
Yes i mean this API: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
And the payload can be like this:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama-3.1",
"prompt_embeds": [[0.1,0.2,0.14,...]], # (Length, Dim) shape instead of prompt
"max_tokens": 7,
"temperature": 0
}'
@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API
Is this an existing endpoint from OpenAI? What would the schema look like in this case?
Yes i mean this API: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py
And the payload can be like this:
curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama-3.1", "prompt_embeds": [[0.1,0.2,0.14,...]], # (Length, Dim) shape instead of prompt "max_tokens": 7, "temperature": 0 }'
I see, thanks for the example. We'll consider this after this PR is merged.
@toilaluan I have a branch for sending prompt embeddings to the openai completions endpoint. Instead of text I used b64 encoding.
prompt_embeds = []
for input_embeds in inputs_embeds: # inputs_embeds is a list of embeddings of shape (seq_len, hidden_size), seq_len can be different
buff = io.BytesIO()
torch.save(input_embeds.detach().cpu(), buff)
prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8"))
response = requests.post(
url,
json={
"model": "llama31",
"prompt_embeds": prompt_embeds,
},
).json()
@DarkLight1337 if we make it part of the openai api, is text like [[0.1,0.2,0.14,...]] preferred over b64 encoding?
@toilaluan I have a branch for sending prompt embeddings to the openai completions endpoint. Instead of text I used b64 encoding.
prompt_embeds = [] for input_embeds in inputs_embeds: # inputs_embeds is a list of embeddings of shape (seq_len, hidden_size), seq_len can be different buff = io.BytesIO() torch.save(input_embeds.detach().cpu(), buff) prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8")) response = requests.post( url, json={ "model": "llama31", "prompt_embeds": prompt_embeds, }, ).json()@DarkLight1337 if we make it part of the openai api, is text like
[[0.1,0.2,0.14,...]]preferred over b64 encoding?
OpenAI's Embeddings endpoint supports returning both floats and base64, so I think it's reasonable to expect that users can pass the embdddings in both forms.