vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Core] generate from input embeds

Open Nan2018 opened this issue 1 year ago • 7 comments

adds support for passing prompt_embeds to LLM.generate as

llm.generate({"prompt_embeds": input_embeds}, sampling_params)

or

llm.generate(
    [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
)

this enables use cases when only the embedding layer is finetuned, and have the same model backend support multiple custom tuned embedding layers

FIX #416 FIX #8323

inspired by https://github.com/vllm-project/vllm/pull/1265 which is very outdated

Nan2018 avatar Jul 27 '24 22:07 Nan2018

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Jul 27 '24 22:07 github-actions[bot]

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

Nan2018 avatar Aug 08 '24 21:08 Nan2018

@WoosukKwon @ywang96 @robertgshaw2-neuralmagic

the failed tests with ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating seems not related to my changes and I can't reproduce it locally.

other than that, this is ready for review

This is due to a recent change from transformers of deprecating default chat template, and it should have been fixed by https://github.com/vllm-project/vllm/pull/7238. Can you merge your branch with main again?

ywang96 avatar Aug 08 '24 21:08 ywang96

ready for review @ywang96 @WoosukKwon @robertgshaw2-neuralmagic

Nan2018 avatar Sep 06 '24 15:09 Nan2018

script I ran for testing:


# %%
ASYNC = True
USE_RAY = False

# %%
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cpu"
model_path = "/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2"  # "/models/huggingface/meta-llama/llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map=device, torch_dtype=torch.bfloat16
).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
embed = model.get_input_embeddings()
if not tokenizer.pad_token:
    if tokenizer.eos_token:
        tokenizer.add_special_tokens(
            {
                "pad_token": tokenizer.eos_token,
            }
        )
    else:
        tokenizer.add_special_tokens({"pad_token": "<PAD>"})
        model.resize_token_embeddings(model.config.vocab_size + 1)

# %%
from vllm import LLM, SamplingParams, AsyncLLMEngine, AsyncEngineArgs
from pprint import pprint

if ASYNC:
    llm = AsyncLLMEngine.from_engine_args(
        AsyncEngineArgs(
            model="/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2",
            tensor_parallel_size=4,
            distributed_executor_backend='ray' if USE_RAY else 'mp',
        )
    )
else:
    llm = LLM(model="/models/huggingface/mistralai/Mistral-7B-Instruct-v0.2",
          distributed_executor_backend='ray' if USE_RAY else 'mp',
          tensor_parallel_size=4,
    )

# %%
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "My favorite book of all time is",
    "When I wake up in the morning, the first thing I do is",
    "The best vacation I ever took was to",
    "If I could have any superpower, it would be",
    "One thing on my bucket list is",
    "My go-to comfort food is",
    "The most inspiring person in history, in my opinion, is",
    "When I was a child, I wanted to grow up to be",
    "If I could live in any period of history, I would choose",
    "A skill I've always wanted to learn but haven't yet is",
    "The most memorable movie quote for me is",
    "In my free time, I like to",
    "If I could meet any fictional character, I would choose",
    "The last dream I remember having was about",
    "My favorite season of the year is",
    "One thing I wish more people knew about me is",
    "The most delicious meal I've ever had was",
    "If I could travel anywhere in the world right now, I would go to",
    "A hobby I have that most people don't know about is",
    "The best piece of advice I've ever received is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=10)

# %%
inputs_embeds = []
for prompt in prompts:
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        input_embeds = embed(input_ids).squeeze(0)
    pprint(input_embeds.shape)
    inputs_embeds.append(input_embeds)

# %%
from uuid import uuid4

if ASYNC:
    results = []
    for prompt in prompts:
        async for request_output in llm.generate(prompt, sampling_params, str(uuid4())):
            pass
        results.append(request_output)
else:
    results = llm.generate(prompts, sampling_params)
pprint(results)

# %%
if ASYNC:
    async for request_output in llm.generate({"prompt_embeds": inputs_embeds[0]}, sampling_params, str(uuid4())):
        results = request_output
else:
    results = llm.generate({"prompt_embeds": inputs_embeds[0]}, sampling_params)
pprint(results)

# %%
if ASYNC:
    results = []
    for input_embeds in inputs_embeds:
        async for request_output in llm.generate({"prompt_embeds": input_embeds}, sampling_params, str(uuid4())):
            pass
        results.append(request_output)
else:
    results = llm.generate(
        [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds], sampling_params
    )
pprint(results)

# %%
from random import shuffle

mixed_inputs = [{"prompt_embeds": input_embeds} for input_embeds in inputs_embeds[:10]] + prompts[10:]
shuffle(mixed_inputs)
if ASYNC:
    results = []
    for mixed_input in mixed_inputs:
        async for request_output in llm.generate(mixed_input, sampling_params, str(uuid4())):
            pass
        results.append(request_output)
else:
    results = llm.generate(
        mixed_inputs,
        sampling_params,
    )
pprint(results)

Nan2018 avatar Sep 06 '24 15:09 Nan2018

Thanks for the great work. So I want to try this PR like below, and my model is self-defined LlamaForCausalLM.

outputs = llm.generate({"prompt_embeds": input_token_embedding}, 
                       sampling_params)

However, I got the error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 outputs = llm.generate({"prompt_embeds": input_token_embedding}, 
      2                        sampling_params)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/utils.py:1032, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1025             msg += f" {additional_message}"
   1027         warnings.warn(
   1028             DeprecationWarning(msg),
   1029             stacklevel=3,  # The inner function takes up one level
   1030         )
-> 1032 return fn(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/entrypoints/llm.py:347, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request, guided_options_request)
    338     sampling_params = SamplingParams()
    340 self._validate_and_add_requests(
    341     inputs=inputs,
    342     params=sampling_params,
    343     lora_request=lora_request,
    344     prompt_adapter_request=prompt_adapter_request,
    345     guided_options=guided_options_request)
--> 347 outputs = self._run_engine(use_tqdm=use_tqdm)
    348 return LLMEngine.validate_outputs(outputs, RequestOutput)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/entrypoints/llm.py:704, in LLM._run_engine(self, use_tqdm)
    702 total_out_toks = 0
    703 while self.llm_engine.has_unfinished_requests():
--> 704     step_outputs = self.llm_engine.step()
    705     for output in step_outputs:
    706         if output.finished:

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/engine/llm_engine.py:1570, in LLMEngine.step(self)
   1566 if allow_async_output_proc:
   1567     execute_model_req.async_callback = self.async_callbacks[
   1568         virtual_engine]
-> 1570 output = self.model_executor.execute_model(
   1571     execute_model_req=execute_model_req)
   1573 # We need to do this here so that last step's sampled_token_ids can
   1574 # be passed to the next iteration for PP.
   1575 if self.scheduler_config.is_multi_step:

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/executor/gpu_executor.py:130, in GPUExecutor.execute_model(self, execute_model_req)
    127 def execute_model(
    128     self, execute_model_req: ExecuteModelRequest
    129 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 130     output = self.driver_worker.execute_model(execute_model_req)
    131     return output

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/worker/worker_base.py:327, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
    322     if (self.observability_config is not None
    323             and self.observability_config.collect_model_execute_time):
    324         orig_model_execute_time = intermediate_tensors.tensors.get(
    325             "model_execute_time", torch.tensor(0)).item()
--> 327 output = self.model_runner.execute_model(
    328     model_input=model_input,
    329     kv_caches=self.kv_cache[worker_input.virtual_engine]
    330     if self.kv_cache is not None else None,
    331     intermediate_tensors=intermediate_tensors,
    332     num_steps=num_steps,
    333     **kwargs,
    334 )
    336 model_execute_time = time.perf_counter() - start_time
    337 if not get_pp_group().is_last_rank:
    338     # output is IntermediateTensors

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/worker/model_runner.py:1505, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
   1501 if self.model_supports_input_embeds:
   1502     model_params.update(
   1503         inputs_embeds=model_input.input_embeds,
   1504         inputs_embeds_masks=model_input.input_embeds_masks)
-> 1505 hidden_or_intermediate_states = model_executable(**model_params)
   1507 if (self.observability_config is not None
   1508         and self.observability_config.collect_model_forward_time):
   1509     model_forward_end.record()

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:433, in LlamaForCausalLM.forward(self, input_ids, positions, kv_caches, attn_metadata, intermediate_tensors, inputs_embeds, inputs_embeds_masks)
    423 def forward(
    424     self,
    425     input_ids: torch.Tensor,
   (...)
    431     inputs_embeds_masks: Optional[torch.Tensor] = None,
    432 ) -> Union[torch.Tensor, IntermediateTensors]:
--> 433     model_output = self.model(input_ids, positions, kv_caches,
    434                               attn_metadata, intermediate_tensors,
    435                               inputs_embeds, inputs_embeds_masks)
    436     return model_output

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:331, in LlamaModel.forward(self, input_ids, positions, kv_caches, attn_metadata, intermediate_tensors, inputs_embeds, inputs_embeds_masks)
    329 for i in range(self.start_layer, self.end_layer):
    330     layer = self.layers[i]
--> 331     hidden_states, residual = layer(
    332         positions,
    333         hidden_states,
    334         kv_caches[i - self.start_layer],
    335         attn_metadata,
    336         residual,
    337     )
    339 if not get_pp_group().is_last_rank:
    340     return IntermediateTensors({
    341         "hidden_states": hidden_states,
    342         "residual": residual
    343     })

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:252, in LlamaDecoderLayer.forward(self, positions, hidden_states, kv_cache, attn_metadata, residual)
    249 else:
    250     hidden_states, residual = self.input_layernorm(
    251         hidden_states, residual)
--> 252 hidden_states = self.self_attn(
    253     positions=positions,
    254     hidden_states=hidden_states,
    255     kv_cache=kv_cache,
    256     attn_metadata=attn_metadata,
    257 )
    259 # Fully Connected
    260 hidden_states, residual = self.post_attention_layernorm(
    261     hidden_states, residual)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/model_executor/models/llama.py:182, in LlamaAttention.forward(self, positions, hidden_states, kv_cache, attn_metadata)
    180 q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
    181 q, k = self.rotary_emb(positions, q, k)
--> 182 attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
    183 output, _ = self.o_proj(attn_output)
    184 return output

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/attention/layer.py:98, in Attention.forward(self, query, key, value, kv_cache, attn_metadata, attn_type)
     88 def forward(
     89     self,
     90     query: torch.Tensor,
   (...)
     95     attn_type: AttentionType = AttentionType.DECODER,
     96 ) -> torch.Tensor:
---> 98     return self.impl.forward(query,
     99                              key,
    100                              value,
    101                              kv_cache,
    102                              attn_metadata,
    103                              self._k_scale,
    104                              self._v_scale,
    105                              attn_type=attn_type)

File ~/anaconda3/envs/demo_tts/lib/python3.10/site-packages/vllm/attention/backends/xformers.py:574, in XFormersImpl.forward(self, query, key, value, kv_cache, attn_metadata, k_scale, v_scale, attn_type)
    569     num_decode_tokens = 0
    571 if attn_type == AttentionType.DECODER:
    572     # Only enforce this shape-constraint for decoder
    573     # self-attention
--> 574     assert key.shape[0] == num_prefill_tokens + num_decode_tokens
    575     assert value.shape[0] == num_prefill_tokens + num_decode_tokens
    577 output = torch.empty_like(query)

OswaldoBornemann avatar Sep 10 '24 09:09 OswaldoBornemann

A few models have been added since this PR. Can you go through the models and check that all of them can support this input?

DarkLight1337 avatar Sep 12 '24 02:09 DarkLight1337

Here is a bug:

vllm/inputs/preprocess.py line333 will set prompt_token_ids=[] but vllm/engine/llm_engine.py line1721 is: if prompt_ids is None

Traceback (most recent call last):
  File "/mnt/bn/integrated-risk-model2/LLM_Inference_Service/llmserver_diy/llmserver/core/vanilla_vllm/vanilla_vllm_scheduler.py", line 42, in inner
    async for result in func(*args, **kwargs):
  File "/mnt/bn/integrated-risk-model2/LLM_Inference_Service/llmserver_diy/llmserver/core/vanilla_vllm/vanilla_vllm_scheduler.py", line 160, in generate
    async for result in self.vanilla_vllm_engine.generate(*generate_input):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 882, in generate
    async for output in await self.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 109, in generator
    raise result
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 666, in engine_step
    await self.engine.add_request_async(**new_request)
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 430, in add_request_async
    self._add_processed_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 627, in _add_processed_request
    self._validate_model_inputs(processed_inputs)
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 1730, in _validate_model_inputs
    raise ValueError("You can only provide either tokens or "
ValueError: You can only provide either tokens or embeddings, not both****

this change will work: if prompt_ids is None or len(prompt_ids)==0:

xvolica avatar Sep 20 '24 01:09 xvolica

And another bug I do not find reason yet Using InternLM2 model Input:

prompt_embeds is a torch.Tensor, it's torch.Size([1777, 4096]) 
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=1, logprobs=20, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)

Error:

ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last):
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 09-20 01:48:34 worker_base.py:464]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_or_intermediate_states = model_executable(**model_params)
ERROR 09-20 01:48:34 worker_base.py:464]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 09-20 01:48:34 worker_base.py:464]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_states, residual = layer(
ERROR 09-20 01:48:34 worker_base.py:464]                               ^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_states = self.attention(
ERROR 09-20 01:48:34 worker_base.py:464]                     ^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-20 01:48:34 worker_base.py:464]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     return self.impl.forward(query,
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     assert key.shape[0] == num_prefill_tokens + num_decode_tokens
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] AssertionError
ERROR 09-20 01:48:34 worker_base.py:464] 
ERROR 09-20 01:48:34 worker_base.py:464] During handling of the above exception, another exception occurred:
ERROR 09-20 01:48:34 worker_base.py:464] 
ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last):
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-20 01:48:34 worker_base.py:464]     return executor(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-20 01:48:34 worker_base.py:464]     output = self.model_runner.execute_model(
ERROR 09-20 01:48:34 worker_base.py:464]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-20 01:48:34 worker_base.py:464]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
ERROR 09-20 01:48:34 worker_base.py:464]     pickle.dump(dumped_inputs, filep)
ERROR 09-20 01:48:34 worker_base.py:464] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
ERROR 09-20 01:48:34 async_llm_engine.py:61] Engine background task failed
ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_or_intermediate_states = model_executable(**model_params)
ERROR 09-20 01:48:34 async_llm_engine.py:61]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 09-20 01:48:34 async_llm_engine.py:61]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_states, residual = layer(
ERROR 09-20 01:48:34 async_llm_engine.py:61]                               ^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_states = self.attention(
ERROR 09-20 01:48:34 async_llm_engine.py:61]                     ^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-20 01:48:34 async_llm_engine.py:61]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self.impl.forward(query,
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     assert key.shape[0] == num_prefill_tokens + num_decode_tokens
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] AssertionError
ERROR 09-20 01:48:34 async_llm_engine.py:61] 
ERROR 09-20 01:48:34 async_llm_engine.py:61] During handling of the above exception, another exception occurred:
ERROR 09-20 01:48:34 async_llm_engine.py:61] 
ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return_value = task.result()
ERROR 09-20 01:48:34 async_llm_engine.py:61]                    ^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
ERROR 09-20 01:48:34 async_llm_engine.py:61]     result = task.result()
ERROR 09-20 01:48:34 async_llm_engine.py:61]              ^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
ERROR 09-20 01:48:34 async_llm_engine.py:61]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-20 01:48:34 async_llm_engine.py:61]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     outputs = await self.model_executor.execute_model_async(
ERROR 09-20 01:48:34 async_llm_engine.py:61]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return await super().execute_model_async(execute_model_req)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return await self.driver_exec_method("execute_model",
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 09-20 01:48:34 async_llm_engine.py:61]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
ERROR 09-20 01:48:34 async_llm_engine.py:61]     raise e
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return executor(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-20 01:48:34 async_llm_engine.py:61]     output = self.model_runner.execute_model(
ERROR 09-20 01:48:34 async_llm_engine.py:61]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
ERROR 09-20 01:48:34 async_llm_engine.py:61]     pickle.dump(dumped_inputs, filep)
ERROR 09-20 01:48:34 async_llm_engine.py:61] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
2024-09-20 01:48:34.319 base_events.py:1771 [ERROR]: Exception in callback _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41
handle: <Handle _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41>
Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
    hidden_or_intermediate_states = model_executable(**model_params)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
    hidden_states = self.attention(
                    ^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
    assert key.shape[0] == num_prefill_tokens + num_decode_tokens
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
    return await super().execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
    raise e
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
    pickle.dump(dumped_inputs, filep)
AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 63, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
    hidden_or_intermediate_states = model_executable(**model_params)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
    hidden_states = self.attention(
                    ^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
    assert key.shape[0] == num_prefill_tokens + num_decode_tokens
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 882, in generate
    async for output in await self.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 109, in generator
    raise result
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
    return await super().execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
    raise e
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
    pickle.dump(dumped_inputs, filep)
AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'

xvolica avatar Sep 20 '24 02:09 xvolica

And another bug I do not find reason yet Using InternLM2 model Input:

prompt_embeds is a torch.Tensor, it's torch.Size([1777, 4096]) 
SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=200, min_tokens=1, logprobs=20, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)

Error:

ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last):
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 09-20 01:48:34 worker_base.py:464]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_or_intermediate_states = model_executable(**model_params)
ERROR 09-20 01:48:34 worker_base.py:464]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 09-20 01:48:34 worker_base.py:464]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_states, residual = layer(
ERROR 09-20 01:48:34 worker_base.py:464]                               ^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     hidden_states = self.attention(
ERROR 09-20 01:48:34 worker_base.py:464]                     ^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-20 01:48:34 worker_base.py:464]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 worker_base.py:464]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     return self.impl.forward(query,
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
ERROR 09-20 01:48:34 worker_base.py:464]     assert key.shape[0] == num_prefill_tokens + num_decode_tokens
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464] AssertionError
ERROR 09-20 01:48:34 worker_base.py:464] 
ERROR 09-20 01:48:34 worker_base.py:464] During handling of the above exception, another exception occurred:
ERROR 09-20 01:48:34 worker_base.py:464] 
ERROR 09-20 01:48:34 worker_base.py:464] Traceback (most recent call last):
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-20 01:48:34 worker_base.py:464]     return executor(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-20 01:48:34 worker_base.py:464]     output = self.model_runner.execute_model(
ERROR 09-20 01:48:34 worker_base.py:464]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-20 01:48:34 worker_base.py:464]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 worker_base.py:464]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
ERROR 09-20 01:48:34 worker_base.py:464]     pickle.dump(dumped_inputs, filep)
ERROR 09-20 01:48:34 worker_base.py:464] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
ERROR 09-20 01:48:34 async_llm_engine.py:61] Engine background task failed
ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_or_intermediate_states = model_executable(**model_params)
ERROR 09-20 01:48:34 async_llm_engine.py:61]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 09-20 01:48:34 async_llm_engine.py:61]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_states, residual = layer(
ERROR 09-20 01:48:34 async_llm_engine.py:61]                               ^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     hidden_states = self.attention(
ERROR 09-20 01:48:34 async_llm_engine.py:61]                     ^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 09-20 01:48:34 async_llm_engine.py:61]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self._call_impl(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return forward_call(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return self.impl.forward(query,
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
ERROR 09-20 01:48:34 async_llm_engine.py:61]     assert key.shape[0] == num_prefill_tokens + num_decode_tokens
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61] AssertionError
ERROR 09-20 01:48:34 async_llm_engine.py:61] 
ERROR 09-20 01:48:34 async_llm_engine.py:61] During handling of the above exception, another exception occurred:
ERROR 09-20 01:48:34 async_llm_engine.py:61] 
ERROR 09-20 01:48:34 async_llm_engine.py:61] Traceback (most recent call last):
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return_value = task.result()
ERROR 09-20 01:48:34 async_llm_engine.py:61]                    ^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
ERROR 09-20 01:48:34 async_llm_engine.py:61]     result = task.result()
ERROR 09-20 01:48:34 async_llm_engine.py:61]              ^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
ERROR 09-20 01:48:34 async_llm_engine.py:61]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-20 01:48:34 async_llm_engine.py:61]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     outputs = await self.model_executor.execute_model_async(
ERROR 09-20 01:48:34 async_llm_engine.py:61]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return await super().execute_model_async(execute_model_req)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return await self.driver_exec_method("execute_model",
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 09-20 01:48:34 async_llm_engine.py:61]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
ERROR 09-20 01:48:34 async_llm_engine.py:61]     raise e
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return executor(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 09-20 01:48:34 async_llm_engine.py:61]     output = self.model_runner.execute_model(
ERROR 09-20 01:48:34 async_llm_engine.py:61]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 09-20 01:48:34 async_llm_engine.py:61]     return func(*args, **kwargs)
ERROR 09-20 01:48:34 async_llm_engine.py:61]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 09-20 01:48:34 async_llm_engine.py:61]   File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
ERROR 09-20 01:48:34 async_llm_engine.py:61]     pickle.dump(dumped_inputs, filep)
ERROR 09-20 01:48:34 async_llm_engine.py:61] AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'
2024-09-20 01:48:34.319 base_events.py:1771 [ERROR]: Exception in callback _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41
handle: <Handle _log_task_completion(error_callback=<bound method...7fb8dc6d3fd0>>)(<Task finishe...weak_bound'")>) at /home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:41>
Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
    hidden_or_intermediate_states = model_executable(**model_params)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
    hidden_states = self.attention(
                    ^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
    assert key.shape[0] == num_prefill_tokens + num_decode_tokens
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
    return await super().execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
    raise e
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
    pickle.dump(dumped_inputs, filep)
AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 63, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1654, in execute_model
    hidden_or_intermediate_states = model_executable(**model_params)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 332, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 284, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 220, in forward
    hidden_states = self.attention(
                    ^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/model_executor/models/internlm2.py", line 166, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/attention/backends/flash_attn.py", line 702, in forward
    assert key.shape[0] == num_prefill_tokens + num_decode_tokens
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 882, in generate
    async for output in await self.add_request(
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 109, in generator
    raise result
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 51, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 755, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 678, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 343, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 523, in execute_model_async
    return await super().execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 540, in _driver_execute_model_async
    return await self.driver_exec_method("execute_model",
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 465, in execute_method
    raise e
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 456, in execute_method
    return executor(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
    output = self.model_runner.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.local/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 140, in _wrapper
    pickle.dump(dumped_inputs, filep)
AttributeError: Can't pickle local object 'weak_bind.<locals>.weak_bound'

Are you using speculative decoding? It's not supported with input embeds yet.

DarkLight1337 avatar Sep 20 '24 02:09 DarkLight1337

Are you using speculative decoding? It's not supported with input embeds yet.

No, here is my config

AsyncEngineArgs(model='./internlm2-chat-7b', served_model_name=None, tokenizer='./internlm2-chat-7b', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=None, worker_use_ray=True, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, disable_log_requests=True)

xvolica avatar Sep 20 '24 02:09 xvolica

I cloned this using gh pr checkout 6869 on latest vllm and looks like there's a bug? Basically, inputs processing is broken. When I add

print(f'inputs {inputs}\n preprocessed_inputs {preprocessed_inputs} \n processed_inputs {processed_inputs}')

right before self._add_processed_request call (vllm/engine/llm_engine.py:749) I get this output(for readability deleted tensor content):

inputs {'prompt_embeds': tensor([[ 0.0024, 0.0022]],
       grad_fn=<EmbeddingBackward0>)}
preprocessed_inputs {'prompt_token_ids': [], 'prompt': None, 'prompt_embeds': tensor([[ 0.0024, 0.0022]],
       grad_fn=<EmbeddingBackward0>), 'multi_modal_data': None} 
processed_inputs {'prompt_token_ids': [], 'prompt': None, 'prompt_embeds': tensor([[ 0.0024,0.0022]],
       grad_fn=<EmbeddingBackward0>), 'multi_modal_data': None}

As you can see, prompt_token_ids is no longer None, and it fails at LLMEngine._validate_model_inputs with ValueError: You can only provide either tokens or embeddings, not both I tried to fix that. but since I'm not sure how this all works I didn't succeded. Ignoring input validation and setting processed_inputs['prompt_token_ids'] = [] causes RuntimeError: CUDA error: device-side assert triggered error, and setting it to None causes error vllm/sequence.py:418 :

SequenceData(
    array(VLLM_TOKEN_ID_ARRAY_TYPE, self.prompt_token_ids),
    self.prompt_embeds)

TypeError: 'NoneType' object is not iterable

I tried two different models, both use LlamaForCausalLM and works fine in 'vanila' vLLM. I tried running it without any sampling parameters, as well as without quantizations, never managed to get it working.

Ouna-the-Dataweaver avatar Sep 20 '24 09:09 Ouna-the-Dataweaver

Yeah, it's bugged now. Going to resume work on this later today. Sorry for breaking this!

I found some deeper issues with how the inputs are being processed right now and am working on a refactor.

DarkLight1337 avatar Sep 20 '24 09:09 DarkLight1337

I have finally fixed input processor to work with embedding inputs. To reduce the scope of this PR, I'll split out some of the changes (in particular the renaming) to other PRs.

DarkLight1337 avatar Sep 20 '24 17:09 DarkLight1337

thank you @DarkLight1337 ! Are there anything else you would like me to do for this pr?

Nan2018 avatar Sep 20 '24 19:09 Nan2018

Let's wait for my other PRs to be merged first.

DarkLight1337 avatar Sep 21 '24 01:09 DarkLight1337

I know that this is still in the works, but I tried it before recent merges and after, and both times I got some errors of more or less the same content, so wanted to report an issue with this PR (or please say if I'm doing something really wrong).

What I was doing:

import vllm.vllm as vllm
from vllm.vllm import LLM, SamplingParams
from transformers import AutoTokenizer, LlamaForCausalLM
import copy

vicuna_path = Path("...")
llm = LLM(model=vicuna_path, max_model_len=1500)
# embeddings generation
model = LlamaForCausalLM.from_pretrained(vicuna_path)
embeddings = copy.deepcopy(model.model.embed_tokens)
del(model)
gc.collect()
message = "USER: Count from 1 to 10 please \nASSISTANT:"
input_ids = tokenizer(message, return_tensors="pt")
embed_tokens = embeddings(input_ids['input_ids'])
embed_tokens.shape # torch.Size([1, 19, 4096])
# skipped sampling params here for simplicity
outputs = llm.generate({'prompt_embeds': embed_tokens})

I get RuntimeError: CUDA error: an illegal memory access was encountered which is probably caused by

File /media/data/agafonov/repos/allm_service/vllm/vllm/attention/backends/flash_attn.py:682, in FlashAttentionImpl.forward(self, query, key, value, kv_cache, attn_metadata, k_scale, v_scale, attn_type)
    679 assert k_scale == 1.0 and v_scale == 1.0, (
    680     "key/v_scale is not supported in FlashAttention.")
--> 682 num_tokens, hidden_size = query.shape
    683 # Reshape the query, key, and value tensors.

ValueError: too many values to unpack (expected 2)

(end of stack trace) After error I can't use that gpu untill I reload my jupyter kernel. vllm: last main version(0.6.2) into gh pr checkout 6869, hardware: 4060ti, linux, cuda 12.1. Usual vllm afrer checking out this PR works fine as far as I can see.

Ouna-the-Dataweaver avatar Sep 26 '24 13:09 Ouna-the-Dataweaver

Oh, I found mistake I made. Basically, this MR expects embeds as tensor without batch, but transformers llm use batched input.

print(f'embeddings shape: {embeds.shape}')
output = self.llama_model.generate(
            inputs_embeds=embeds,...
)
# embeddings shape: torch.Size([1, 112, 4096])

So it works with (batch_len, tokens, emb_dim) tensors, meanwhile your code expects each tensor to be shape (tokens, emb_dim). I added torch.squeeze(embed_tokens) and it worked with vllm.

Not sure if that needs to be fixed or not, but for idiots like me when you merge this add that expected no batch dimension info in docs (:

Ouna-the-Dataweaver avatar Sep 27 '24 04:09 Ouna-the-Dataweaver

Oh, I found mistake I made. Basically, this MR expects embeds as tensor without batch, but transformers llm use batched input.

print(f'embeddings shape: {embeds.shape}')
output = self.llama_model.generate(
            inputs_embeds=embeds,...
)
# embeddings shape: torch.Size([1, 112, 4096])

So it works with (batch_len, tokens, emb_dim) tensors, meanwhile your code expects each tensor to be shape (tokens, emb_dim). I added torch.squeeze(embed_tokens) and it worked with vllm.

Not sure if that needs to be fixed or not, but for idiots like me when you merge this add that expected no batch dimension info in docs (:

Sorry for the late response. I'll certainly add a check for that!

Update: Done!

DarkLight1337 avatar Sep 28 '24 06:09 DarkLight1337

@DarkLight1337 Is there any update on this PR? I'm very excited to use this feature.

qthequartermasterman avatar Oct 04 '24 16:10 qthequartermasterman

Thanks for your interest. I'm still waiting for #8688 and a subsequent PR that refactors input processing to be merged first.

DarkLight1337 avatar Oct 05 '24 02:10 DarkLight1337

I tried the newest pr using cmd 'gh pr checkout 6869', but seems failed:

[rank0]: Traceback (most recent call last):
[rank0]:   File "internlm2/embed_vllm.py.py", line 45, in <module>
[rank0]:   File "vllm-master/vllm/vllm/utils.py", line 1051, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/entrypoints/llm.py", line 391, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/entrypoints/llm.py", line 899, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/engine/llm_engine.py", line 1356, in step
[rank0]:     ) = self.scheduler[virtual_engine].schedule()
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/core/scheduler.py", line 1218, in schedule
[rank0]:     scheduler_outputs: SchedulerOutputs = self._schedule()
[rank0]:                                           ^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/core/scheduler.py", line 1178, in _schedule
[rank0]:     return self._schedule_default()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/core/scheduler.py", line 1013, in _schedule_default
[rank0]:     prefills = self._schedule_prefills(budget,
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/core/scheduler.py", line 949, in _schedule_prefills
[rank0]:     self._allocate_and_set_running(seq_group)
[rank0]:   File "vllm-master/vllm/vllm/core/scheduler.py", line 1411, in _allocate_and_set_running
[rank0]:     self.block_manager.allocate(seq_group)
[rank0]:   File "vllm-master/vllm/vllm/core/block_manager_v2.py", line 169, in allocate
[rank0]:     block_table: BlockTable = self._allocate_sequence(seq)
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "vllm-master/vllm/vllm/core/block_manager_v2.py", line 155, in _allocate_sequence
[rank0]:     block_table.allocate(seq.get_token_ids())
[rank0]:   File "vllm-master/vllm/vllm/core/block/block_table.py", line 95, in allocate
[rank0]:     assert token_ids
[rank0]: AssertionError

xvolica avatar Oct 08 '24 23:10 xvolica

I tried the newest pr using cmd 'gh pr checkout 6869', but seems failed:

We will likely need another pass to get this PR to work with blockmanagerv2. I'm still waiting for previous PRs to be merged.

DarkLight1337 avatar Oct 09 '24 02:10 DarkLight1337

@qthequartermasterman Sorry for the delay, ~~I think I have fixed the breaking issues now~~. #9604 is the last PR to be merged before we can move forward with this PR.

I'm still unable to pass the assertion of num_new_tokens > 0. For input embeddings, there are no token inputs to the model, so we may need to update the existing code to accommodate this. However this will likely interfere with the current re-arch efforts. @WoosukKwon @comaniac any suggestions?

DarkLight1337 avatar Oct 23 '24 08:10 DarkLight1337

@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API

toilaluan avatar Oct 30 '24 12:10 toilaluan

@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API

Is this an existing endpoint from OpenAI? What would the schema look like in this case?

DarkLight1337 avatar Oct 30 '24 13:10 DarkLight1337

@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API

Is this an existing endpoint from OpenAI? What would the schema look like in this case?

Yes i mean this API: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py

And the payload can be like this:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama-3.1",
    "prompt_embeds": [[0.1,0.2,0.14,...]], # (Length, Dim) shape instead of prompt
    "max_tokens": 7,
    "temperature": 0
  }'

toilaluan avatar Oct 30 '24 13:10 toilaluan

@DarkLight1337 @Nan2018 thanks guys, this is essential pr. Can you guys make it to be supported by OpenAI Compatible Endpoint? So that, it would be convenient to call it by Restful API

Is this an existing endpoint from OpenAI? What would the schema look like in this case?

Yes i mean this API: https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py

And the payload can be like this:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama-3.1",
    "prompt_embeds": [[0.1,0.2,0.14,...]], # (Length, Dim) shape instead of prompt
    "max_tokens": 7,
    "temperature": 0
  }'

I see, thanks for the example. We'll consider this after this PR is merged.

DarkLight1337 avatar Oct 30 '24 13:10 DarkLight1337

@toilaluan I have a branch for sending prompt embeddings to the openai completions endpoint. Instead of text I used b64 encoding.

prompt_embeds = []
for input_embeds in inputs_embeds: # inputs_embeds is a list of embeddings of shape (seq_len, hidden_size), seq_len can be different
    buff = io.BytesIO()
    torch.save(input_embeds.detach().cpu(), buff)
    prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8"))

response = requests.post(
    url,
    json={
        "model": "llama31",
        "prompt_embeds": prompt_embeds,
    },
).json()

@DarkLight1337 if we make it part of the openai api, is text like [[0.1,0.2,0.14,...]] preferred over b64 encoding?

Nan2018 avatar Oct 30 '24 17:10 Nan2018

@toilaluan I have a branch for sending prompt embeddings to the openai completions endpoint. Instead of text I used b64 encoding.

prompt_embeds = []
for input_embeds in inputs_embeds: # inputs_embeds is a list of embeddings of shape (seq_len, hidden_size), seq_len can be different
    buff = io.BytesIO()
    torch.save(input_embeds.detach().cpu(), buff)
    prompt_embeds.append(b64encode(buff.getvalue()).decode("utf-8"))

response = requests.post(
    url,
    json={
        "model": "llama31",
        "prompt_embeds": prompt_embeds,
    },
).json()

@DarkLight1337 if we make it part of the openai api, is text like [[0.1,0.2,0.14,...]] preferred over b64 encoding?

OpenAI's Embeddings endpoint supports returning both floats and base64, so I think it's reasonable to expect that users can pass the embdddings in both forms.

DarkLight1337 avatar Oct 31 '24 01:10 DarkLight1337