VLLM 用AsyncLLMEngine推理结果报错

Open elfisworking opened this issue 1 year ago • 0 comments

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

模型MiniCPM-Llama3-V2.5使用如下的脚本推理正常

from PIL import Image
from vllm import LLM, SamplingParams,EngineArgs,LLMEngine

MODEL_NAME = "/model"
# Also available for previous models
# MODEL_NAME = "openbmb/MiniCPM-Llama3-V-2_5"
# MODEL_NAME = "HwwwH/MiniCPM-V-2"


if __name__ == "__main__":
    image = Image.open("/app/fruit_stand.jpg").convert("RGB")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
    llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
    gpu_memory_utilization=1,
    max_model_len=2048,
    tensor_parallel_size=2
    )
    messages = [{
        "role":
        "user",
        "content":
        # Number of images
        "(<image>./</image>)" + \
        "\n这是一张什么图片?"
    }]
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Single Inference
    inputs = {
        "prompt": prompt,
        "multi_modal_data": {
            "image": image
            # Multi images, the number of images should be equal to that of `(<image>./</image>)`
            # "image": [image, image]
        },
    }
    # Batch Inference
    # inputs = [{
    #     "prompt": prompt,
    #     "multi_modal_data": {
    #         "image": image
    #     },
    # } for _ in 2]


    # 2.6
    #stop_tokens = ['<|im_end|>', '<|endoftext|>']
    #stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
    # 2.0
    # stop_token_ids = [tokenizer.eos_id]
    # 2.5
    stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]

    sampling_params = SamplingParams(
        stop_token_ids=stop_token_ids,
        use_beam_search=True,
        temperature=0,
        best_of=3,
        max_tokens=1024
    )

    outputs = llm.generate(inputs, sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)```
但是当不使用LLM使用AsyncLLMEngine时，推理结果不正确

### 期望行为 | Expected Behavior

_No response_

### 复现方法 | Steps To Reproduce

_No response_

### 运行环境 | Environment

```Markdown
- OS:centos
- Python:3.10
- Transformers:4.44.0
- PyTorch:2.4.0
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
- vllm:0.5.4

备注 | Anything else?

No response

Aug 15 '24 07:08 elfisworking