ipex-llm Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving.

We have seen a significant difference in performance drop with the env created with the latest repo for vllm serving for the neural-chat model as compared to the old env built with the old repo. With the offline_inference.py script, for the default prompt Old env gives inference time between 7-11 sec with GPU utilization of only 50% while the new env gives 18-24 sec, with GPU utilization of 100% on Flex 170. I tried the docker env also but it's also giving the inference time between 18-23 sec. Given below are the env details :

Old env accelerate 0.21.0 annotated-types 0.6.0 anyio 4.3.0 bigdl-core-xe-21 2.5.0b20240402 bigdl-core-xe-esimd-21 2.5.0b20240402 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 exceptiongroup 1.2.0 fastapi 0.110.1 filelock 3.13.3 fsspec 2024.3.1 h11 0.14.0 httptools 0.6.1 huggingface-hub 0.17.3 idna 3.6 intel-extension-for-pytorch 2.1.10+xpu intel-openmp 2024.1.0 ipex-llm 2.1.0b20240402 Jinja2 3.1.3 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.2.1 numpy 1.26.4 packaging 24.0 pillow 10.3.0 pip 23.3.1 protobuf 5.26.1 psutil 5.9.8 py-cpuinfo 9.0.0 pydantic 1.10.15 pydantic_core 2.18.0 python-dotenv 1.0.1 PyYAML 6.0.1 regex 2023.12.25 requests 2.31.0 safetensors 0.4.2 sentencepiece 0.2.0 setuptools 68.2.2 sniffio 1.3.1 starlette 0.37.2 sympy 1.12.1rc1 tabulate 0.9.0 tokenizers 0.14.1 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.2 transformers 4.34.0 typing_extensions 4.11.0rc1 urllib3 2.2.1 uvicorn 0.29.0 uvloop 0.19.0 watchfiles 0.21.0 websockets 12.0 wheel 0.41.2

New env accelerate 0.21.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.3.0 attrs 23.2.0 bigdl-core-xe-21 2.5.0b20240425 bigdl-core-xe-esimd-21 2.5.0b20240425 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.2 diskcache 5.6.3 einops 0.7.0 fastapi 0.110.1 filelock 3.13.4 frozenlist 1.4.1 fsspec 2024.3.1 h11 0.14.0 httptools 0.6.1 huggingface-hub 0.22.2 idna 3.7 intel-extension-for-pytorch 2.1.10+xpu intel-openmp 2024.1.0 interegular 0.3.3 ipex-llm 2.1.0b20240425 Jinja2 3.1.3 joblib 1.4.0 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 lark 1.1.9 llvmlite 0.42.0 MarkupSafe 2.1.5 mpmath 1.3.0 msgpack 1.0.8 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 numba 0.59.1 numpy 1.26.4 oneccl-bind-pt 2.1.100+xpu outlines 0.0.34 packaging 24.0 pandas 2.2.2 pillow 10.3.0 pip 23.3.1 prometheus_client 0.20.0 protobuf 5.26.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 16.0.0 pydantic 2.7.1 pydantic_core 2.18.2 pynvml 11.5.0 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 pytz 2024.1 PyYAML 6.0.1 ray 2.12.0 referencing 0.35.0 regex 2023.12.25 requests 2.31.0 rpds-py 0.18.0 safetensors 0.4.3 scipy 1.13.0 sentencepiece 0.2.0 setuptools 68.2.2 six 1.16.0 sniffio 1.3.1 starlette 0.37.2 sympy 1.12.1rc1 tabulate 0.9.0 tiktoken 0.6.0 tokenizers 0.19.1 torch 2.1.0a0+cxx11.abi torchvision 0.16.0a0+cxx11.abi tqdm 4.66.2 transformers 4.40.1 transformers-stream-generator 0.0.5 triton 2.1.0 typing_extensions 4.11.0 tzdata 2024.1 urllib3 2.2.1 uvicorn 0.29.0 uvloop 0.19.0 vllm 0.3.3+xpu0.0.1 /root/vllm watchfiles 0.21.0 websockets 12.0 wheel 0.41.2 xformers 0.0.25.post1

Also with the new env its giving the below error with bfloat16 datatype is it not supported now?

May 03 '24 07:05 Vasud-ha

I don't know if it's related, but I also noticed a drop in performance running Llama 3 under Ollama recently (so using the IPEX llama.cpp implementation). I was originally getting ~50t/s on inference, ran a rebuild (still with the intelanalytics/ipex-llm-xpu:latest image as a base) and suddenly it dropped to ~30t/s. I couldn't find any combination of host drivers or OneAPI packages that would get the performance back.

At this point, it's difficult to justify sticking with the Intel platform - my old RX 6600 XT is 30% faster than the A770 is now!

May 05 '24 00:05 digitalscream

Hi, I am working to reproduce this issue.

May 06 '24 02:05 gc-fu

Can you post the result of the offline_inference.py within your old environment?

We fix a bug recently that may cause the generation ends early. So if the generation ends early with wired output, the inference will be quicker.

May 06 '24 02:05 gc-fu

Hi @digitalscream , based on our local test, Llama3 could get ~50 tokens/s on a single A770.

I was originally getting ~50t/s on inference, ran a rebuild (still with the intelanalytics/ipex-llm-xpu:latest image as a base) and suddenly it dropped to ~30t/s.

I wonder is there anything changed in this rebuild progress? This performance degradation looks more like a driver related issue.

May 06 '24 05:05 rnwang04

Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216

Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker#offline-benchmark-through-benchmark_throughputpy

The benchmark_throughput.py can be acquired at here

May 06 '24 07:05 gc-fu

Hi @digitalscream , based on our local test, Llama3 could get ~50 tokens/s on a single A770.

I was originally getting ~50t/s on inference, ran a rebuild (still with the intelanalytics/ipex-llm-xpu:latest image as a base) and suddenly it dropped to ~30t/s.

I wonder is there anything changed in this rebuild progress? This performance degradation looks more like a driver related issue.

Sorry, should've mentioned - that's using a single A770, the only change in the Docker image was that it pulled the intelanalytics/ipex-llm-xpu:latest base image again. I do have some new information, though - seems like it's heavily CPU-bound now, where it wasn't before. After a bit of experimentation, I get 30t/s on my Ryzen 3600 machine, and 41t/s on my i5-13600 machine. Originally, I was getting ~50t/s on the Ryzen machine.

May 06 '24 07:05 digitalscream

Can you post the result of the offline_inference.py within your old environment?

We fix a bug recently that may cause the generation ends early. So if the generation ends early with wired output, the inference will be quicker.

May 06 '24 07:05 Vasud-ha

Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216

Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker#offline-benchmark-through-benchmark_throughputpy

The benchmark_throughput.py can be acquired at here

Yes, my old environment's vllm has the code mentioned here. I Will try benchmark_throughput.py script. Thanks.

May 06 '24 08:05 Vasud-ha

Sorry, should've mentioned - that's using a single A770, the only change in the Docker image was that it pulled the intelanalytics/ipex-llm-xpu:latest base image again. I do have some new information, though - seems like it's heavily CPU-bound now, where it wasn't before. After a bit of experimentation, I get 30t/s on my Ryzen 3600 machine, and 41t/s on my i5-13600 machine. Originally, I was getting ~50t/s on the Ryzen machine.

Thanks for more information provided! When you updated the Docker image, did you update the version of the ipex-llm [cpp] and updated the ollama binary file you used? We have a conjecture about the performance degradation on Ryzen 3600, suspecting that this issue is related to one of our previous PR which affecting a certain function on the CPU. We have already reverted that PR. Perhaps you can try our latest release tomorrow (pip install --pre --upgrade ipex-llm[cpp] and don't forget to init-ollama again) to see if this issue can be resolved?

May 06 '24 09:05 rnwang04

Sorry, should've mentioned - that's using a single A770, the only change in the Docker image was that it pulled the intelanalytics/ipex-llm-xpu:latest base image again. I do have some new information, though - seems like it's heavily CPU-bound now, where it wasn't before. After a bit of experimentation, I get 30t/s on my Ryzen 3600 machine, and 41t/s on my i5-13600 machine. Originally, I was getting ~50t/s on the Ryzen machine.

Thanks for more information provided! When you updated the Docker image, did you update the version of the ipex-llm [cpp] and updated the ollama binary file you used? We have a conjecture about the performance degradation on Ryzen 3600, suspecting that this issue is related to one of our previous PR which affecting a certain function on the CPU. We have already reverted that PR. Perhaps you can try our latest release tomorrow (pip install --pre --upgrade ipex-llm[cpp] and don't forget to init-ollama again) to see if this issue can be resolved?

Ah, OK - yes, I updated everything when I rebuilt it from the base image. Is there another issue regarding the Ryzen performance? Don't want to pollute this one if there's a more appropriate place to discuss it.

May 06 '24 09:05 digitalscream

Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216 Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker#offline-benchmark-through-benchmark_throughputpy The benchmark_throughput.py can be acquired at here

Yes, my old environment's vllm has the code mentioned here. I Will try benchmark_throughput.py script. Thanks.

Hi @gc-fu I tried to run offline_inference.py with the latest code still getting 17sec of latency. Also, I tested the benchmark_throughput.py script, could you suggest how to get the inference latency from end to end?

May 07 '24 10:05 Vasud-ha

The offline_inference.py is not designed for performance benchmark.

If you wanna get latency from end to end or get request per second, you should start the service according to this readme. Then you can send requests to the service using benchmark tools like wrk or jmeter.

The result of benchmark_throughput.py should give you some insight about token per second, which should indicate the performance of different versions.

Could you please post the result of benchmark_throughput script for old/new environment?

May 07 '24 12:05 gc-fu

Thanks @gc-fu, with docker env the benchmark_throughput.py gives a throughput of 489.68 token/sec for 1000 prompt (default settings), however, this script is not available in the docker directory of the old repo.

May 07 '24 13:05 Vasud-ha

Can you check if this official benchmark script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py can be used or not?

If not, can you post the docker image name and tag so that I can see if I can find an proper script for you :smiley: And it would be most helpful if you can post the entire offline_inference.py script in your old environment. It has been a long time since we post the old docker image :cry:

May 08 '24 01:05 gc-fu

I tried running the script in the old repo but facing import issues.

This is the docker image ipex-llm-serving-xpu:2.1.0-SNAPSHOT

This is the offline_inference.py script

from ipex_llm.vllm.entrypoints.llm import LLM 
from ipex_llm.vllm.sampling_params import SamplingParams
import time
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/root/neural-chat-7b-v3/", load_in_low_bit="sym_int4", dtype="bfloat16", device="xpu")
st_time = time.time()
outputs = llm.generate(prompts, sampling_params)
en_time = time.time()
print(f'Inference time: {en_time-st_time} s')
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

May 08 '24 06:05 Vasud-ha

In this case, can you try the following script?

"""Benchmark offline inference throughput."""
import argparse
import json
import random
import time
from typing import List, Optional, Tuple

import torch
from ipex_llm.transformers import AutoModelForCausalLM
# from transformers import AutoModelForCausalLM

from transformers import PreTrainedTokenizerBase
from tqdm import tqdm

#from vllm import LLM, SamplingParams
from ipex_llm.vllm.entrypoints.llm import LLM
from ipex_llm.vllm.sampling_params import SamplingParams
#from vllm.transformers_utils.tokenizer import get_tokenizer
from ipex_llm.vllm.transformers_utils.tokenizer import get_tokenizer

device = 'xpu'
if device == 'xpu':
    import intel_extension_for_pytorch as ipex


def sample_requests(
    dataset_path: str,
    num_requests: int,
    tokenizer: PreTrainedTokenizerBase,
    fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]:
    if fixed_output_len is not None and fixed_output_len < 4:
        raise ValueError("output_len too small")

    # Load the dataset.
    with open(dataset_path) as f:
        dataset = json.load(f)
    # Filter out the conversations with less than 2 turns.
    dataset = [data for data in dataset if len(data["conversations"]) >= 2]
    # Only keep the first two turns of each conversation.
    dataset = [(data["conversations"][0]["value"],
                data["conversations"][1]["value"]) for data in dataset]

    # Tokenize the prompts and completions.
    prompts = [prompt for prompt, _ in dataset]
    prompt_token_ids = tokenizer(prompts).input_ids
    completions = [completion for _, completion in dataset]
    completion_token_ids = tokenizer(completions).input_ids
    tokenized_dataset = []
    for i in range(len(dataset)):
        output_len = len(completion_token_ids[i])
        if fixed_output_len is not None:
            output_len = fixed_output_len
        tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))

    # Filter out too long sequences.
    filtered_dataset: List[Tuple[str, int, int]] = []
    for prompt, prompt_token_ids, output_len in tokenized_dataset:
        prompt_len = len(prompt_token_ids)
        if prompt_len < 4 or output_len < 4:
            # Prune too short sequences.
            continue
        if prompt_len > 1024 or prompt_len + output_len > 2048:
            # Prune too long sequences.
            continue
        filtered_dataset.append((prompt, prompt_len, output_len))

    # Sample the requests.
    sampled_requests = random.sample(filtered_dataset, num_requests)
    return sampled_requests


def run_vllm(
    requests: List[Tuple[str, int, int]],
    model: str,
    tokenizer: str,
    quantization: Optional[str],
    tensor_parallel_size: int,
    seed: int,
    n: int,
    use_beam_search: bool,
    trust_remote_code: bool,
    dtype: str,
    max_num_seqs: int,
) -> float:
    llm = LLM(
        model=model,
        tokenizer=tokenizer,
        quantization=quantization,
        #tensor_parallel_size=tensor_parallel_size,
        seed=42,
        trust_remote_code=trust_remote_code,
        dtype=dtype,
				# change here
		device=device,
        max_num_batched_tokens=204800,
        max_model_len=2048,
        max_num_seqs=max_num_seqs,
    )
    warm_prompt = "hi " * (1024 - 1)
    warm_requests = [(warm_prompt, 1024, 1024)
                    for _ in range(1)]
    for prompt, _, output_len in warm_requests:
        sampling_params = SamplingParams(
            n=n,
            temperature=0.0 if use_beam_search else 1.0,
            top_p=1.0,
            use_beam_search=use_beam_search,
            ignore_eos=True,
            max_tokens=output_len,
        )
        llm._add_request(
            prompt=prompt,
            prompt_token_ids=None,
            sampling_params=sampling_params,
        )
    llm._run_engine(use_tqdm=True)

    # Add the requests to the engine.
    for prompt, _, output_len in requests:
        sampling_params = SamplingParams(
            n=n,
            temperature=0.0 if use_beam_search else 1.0,
            top_p=1.0,
            use_beam_search=use_beam_search,
            ignore_eos=True,
            max_tokens=output_len,
        )
        # FIXME(woosuk): Do not use internal method.
        llm._add_request(
            prompt=prompt,
            prompt_token_ids=None,
            sampling_params=sampling_params,
        )

    start = time.perf_counter()
    # FIXME(woosuk): Do use internal method.
    llm._run_engine(use_tqdm=True)
    end = time.perf_counter()
    return end - start


def run_hf(
    requests: List[Tuple[str, int, int]],
    model: str,
    tokenizer: PreTrainedTokenizerBase,
    n: int,
    use_beam_search: bool,
    max_batch_size: int,
    trust_remote_code: bool,
) -> float:
    assert not use_beam_search
    llm = AutoModelForCausalLM.from_pretrained(
        model, load_in_4bit=True,  optimize_model=True,
                                                 trust_remote_code=True,
                                                 use_cache=True)
    # llm = AutoModelForCausalLM.from_pretrained(
    #     model, trust_remote_code=True, use_cache=True, torch_dtype=torch.bfloat16,
    # )

    tokenizer.pad_token = tokenizer.eos_token
    if device == 'xpu':
        llm = llm.to('xpu')

    # warmup
    warm_prompt = "hi " * (1000 - 1)
    input_ids = tokenizer(warm_prompt, return_tensors="pt",
                              padding=True).input_ids

    if device == 'xpu':
        input_ids = input_ids.to('xpu')
    _ = llm.generate(
            input_ids=input_ids,
            do_sample=False,
            num_return_sequences=n,
            num_beams=1,
            temperature=1.0,
            top_p=1.0,
            use_cache=True,
            max_new_tokens=1024,
            pad_token_id=tokenizer.pad_token_id,
        )

    pbar = tqdm(total=len(requests))
    start = time.perf_counter()
    batch: List[str] = []
    max_prompt_len = 0
    max_output_len = 0
    for i in range(len(requests)):
        prompt, prompt_len, output_len = requests[i]
        # Add the prompt to the batch.
        batch.append(prompt)
        max_prompt_len = max(max_prompt_len, prompt_len)
        max_output_len = max(max_output_len, output_len)
        if len(batch) < max_batch_size and i != len(requests) - 1:
            # Check if we can add more requests to the batch.
            _, next_prompt_len, next_output_len = requests[i + 1]
            if (max(max_prompt_len, next_prompt_len) +
                    max(max_output_len, next_output_len)) <= 2048:
                # We can add more requests to the batch.
                continue

        # Generate the sequences.
        # print(batch)
        input_ids = tokenizer(batch, return_tensors="pt",
                              padding=True).input_ids
        if device == 'xpu':
            input_ids = input_ids.to('xpu')
        llm_outputs = llm.generate(
            input_ids=input_ids,
            do_sample=False,
            num_return_sequences=n,
            num_beams=1,
            temperature=1.0,
            top_p=1.0,
            use_cache=True,
            max_new_tokens=max_output_len,
            pad_token_id=tokenizer.pad_token_id,
        )
        # Include the decoding time.
        tokenizer.batch_decode(llm_outputs, skip_special_tokens=True)
        pbar.update(len(batch))

        # Clear the batch.
        batch = []
        max_prompt_len = 0
        max_output_len = 0
    end = time.perf_counter()
    return end - start


def main(args: argparse.Namespace):
    print(args)
    random.seed(args.seed)

    # Sample the requests.
    tokenizer = get_tokenizer(args.tokenizer,
                              unk_token="<unk>",
                              trust_remote_code=args.trust_remote_code)
    if args.dataset is None:
        # Synthesize a prompt with the given input length.
        prompt = "hi " * (args.input_len - 1)
        requests = [(prompt, args.input_len, args.output_len)
                    for _ in range(args.num_prompts)]
    else:
        requests = sample_requests(args.dataset, args.num_prompts, tokenizer,
                                   args.output_len)

    if args.backend == "vllm":
        elapsed_time = run_vllm(requests, args.model, args.tokenizer,
                                args.quantization, args.tensor_parallel_size,
                                args.seed, args.n, args.use_beam_search,
                                args.trust_remote_code, args.dtype, args.max_num_seqs)
    elif args.backend == "hf":
        assert args.tensor_parallel_size == 1
        elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
                              args.use_beam_search, args.hf_max_batch_size,
                              args.trust_remote_code)
    else:
        raise ValueError(f"Unknown backend: {args.backend}")
    total_num_tokens = sum(prompt_len + output_len
                           for _, prompt_len, output_len in requests)
    print(f"Throughput: {len(requests) / elapsed_time:.4f} requests/s, "
          f"{total_num_tokens / elapsed_time:.2f} tokens/s")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Benchmark the throughput.")
    parser.add_argument("--backend",
                        type=str,
                        choices=["vllm", "hf"],
                        default="vllm")
    parser.add_argument("--dataset",
                        type=str,
                        default=None,
                        help="Path to the dataset.")
    parser.add_argument("--input-len",
                        type=int,
                        default=None,
                        help="Input prompt length for each request")
    parser.add_argument("--output-len",
                        type=int,
                        default=None,
                        help="Output length for each request. Overrides the "
                        "output length from the dataset.")
    parser.add_argument("--model", type=str, default="facebook/opt-125m")
    parser.add_argument("--tokenizer", type=str, default=None)
    parser.add_argument('--quantization',
                        '-q',
                        choices=['awq', None],
                        default=None)
    parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1)
    parser.add_argument("--n",
                        type=int,
                        default=1,
                        help="Number of generated sequences per prompt.")
    parser.add_argument("--use-beam-search", action="store_true")
    parser.add_argument("--num-prompts",
                        type=int,
                        default=1000,
                        help="Number of prompts to process.")
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--hf-max-batch-size",
                        type=int,
                        default=None,
                        help="Maximum batch size for HF backend.")
    parser.add_argument("--max-num-seqs", type=int, default=8)
    parser.add_argument('--trust-remote-code',
                        action='store_true',
                        help='trust remote code from huggingface')
    parser.add_argument(
        '--dtype',
        type=str,
        default='auto',
        choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
        help='data type for model weights and activations. '
        'The "auto" option will use FP16 precision '
        'for FP32 and FP16 models, and BF16 precision '
        'for BF16 models.')
    args = parser.parse_args()

    if args.backend == "vllm":
        if args.hf_max_batch_size is not None:
            raise ValueError("HF max batch size is only for HF backend.")
    elif args.backend == "hf":
        if args.hf_max_batch_size is None:
            raise ValueError("HF max batch size is required for HF backend.")
        if args.quantization is not None:
            raise ValueError("Quantization is only for vLLM backend.")
    if args.tokenizer is None:
        args.tokenizer = args.model
    
    if args.dataset is None:
        assert args.input_len is not None
        assert args.output_len is not None
    else:
        assert args.input_len is None

    main(args)

Also, can you try to run the following command in your docker environment and post the result?

find / -name "bigdl_mistral.py"

If you successfully find the file in your environment, then you are using an quite old version of vLLM that not fully implement the PagedAttention algorithm.

May 08 '24 15:05 gc-fu

I Faced this error while trying out the above code, could you suggest how to resolve it?

May 10 '24 07:05 Vasud-ha

Hi, the vLLM you used is deprecated and will not be supported anymore :cry:

The old vLLM does not use PagedAttention and do not perform good enough in our tests. Besides, the old vLLM suffered from Out of Memory issue in GPU environment.

Try using the latest vLLM instead, I am pretty sure the new vLLM is quicker than the old one.

May 11 '24 02:05 gc-fu

Hi @gc-fu , I couldn't locate the benchmark_throughput.py file inside docker, could you share the path? This is the docker image built, earlier I was able to locate it, but rebuild the image and now I couldn't find it.

May 14 '24 11:05 Vasud-ha

The image you should use is this one: intelanalytics/ipex-llm-serving-xpu

Try check the README here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker

May 14 '24 12:05 gc-fu

The image you should use is this one: intelanalytics/ipex-llm-serving-xpu

Try check the README here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker

Can we remove the deprecated docker image?

May 14 '24 14:05 jason-dai

The image you should use is this one: intelanalytics/ipex-llm-serving-xpu Try check the README here: https://github.com/intel-analytics/ipex-llm/tree/main/docker/llm/serving/xpu/docker

Can we remove the deprecated docker image?

Yes, we can. There are basically two images related to vLLM:

vLLM-CPU: ipex-llm-serving-cpu. Since vLLM-v1 is removed from our codebase, this image no longer contains any code related to vLLM.

vLLM-XPU: ipex-llm-serving-xpu. This is the only available image that users should use for now.

At the beginning of this issue, the user uses the ipex-llm-serving-xpu image that was build long ago which contains the vLLM-v1 code that is deprecated. If the user pulls the image again, the old code will disappear.

I will remove the vLLM-CPU example page later.

May 14 '24 14:05 gc-fu

ipex-llm ipex-llm copied to clipboard

Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving.

ipex-llm
ipex-llm copied to clipboard