Guancheng Fu
Guancheng Fu
Seems with the same condition, v4.38.0 is correct. I guess there are something missing in our `native_sdp`. Update: the `attention_mask` is always None while doing calculation. I guess this is...
> I test on transformers v4.38.0, when optimize_model=True, it cannot work due to transformers 4.38+ add `cache_position` parameter to forward. The error is as below: Traceback (most recent call last):...
Closed as completed~ Thanks @jenniew
Hi, I am working to reproduce this issue.
Can you post the result of the `offline_inference.py` within your old environment? We fix a bug recently that may cause the generation ends early. So if the generation ends early...
Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216 Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here:...
The `offline_inference.py` is not designed for performance benchmark. If you wanna get latency from end to end or get request per second, you should start the service according to this...
Can you check if this official benchmark script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py can be used or not? If not, can you post the docker image name and tag so that I can see...
In this case, can you try the following script? ```python """Benchmark offline inference throughput.""" import argparse import json import random import time from typing import List, Optional, Tuple import torch...
Hi, the vLLM you used is deprecated and will not be supported anymore :cry: The old vLLM does not use PagedAttention and do not perform good enough in our tests....