Mike Yang issues

Results 8 issues of


                                            Mike Yang

token_latency parameter not support in the xpu-main branch run_generation.py

### Describe the bug run the test script intel-extension-for-pytorch/examples/gpu/inference/python/llm/run_benchmark.sh It will have the following error Namespace(model_id='/home/llm/disk/llm/meta-llama/Llama-2-7b-hf', sub_model_name='llama2-7b', device='xpu', dtype='float16', input_tokens='1024', max_new_tokens=128, prompt=None, greedy=False, ipex=True, jit=False, profile=False, benchmark=True, lambada=False, dataset='lambada', num_beams=4,...

XPU/GPU

LLM

Need modify Deepspeed test script to support Xeon CPU

python/llm/example/GPU/Deepspeed-AutoTP/run_qwen_14b_arc_2_card.sh python/llm/example/GPU/Deepspeed-AutoTP/run_vicuna_33b_arc_2_card.sh python/llm/dev/benchmark/all-in-one/run-deepspeed-arc.sh Current the following code only enable on the Intel Core CPU. But on Intel Xeon CPU, also need enable the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS to improve performance. ``` if grep...

user issue

Modify the benchmark output file name with hours, minutes information

The current all-in-one benchmark save the csv file with the name only have date. if we run multi test in the same day, the older test data will be overwrite...

user issue

Add multi GPU support in the AutoModelForCausalLM.load_low_bit API.

Every time when I run the test, it will load the original model and covert to lower bit. If we load a 34B model on 4 ARC card, it will...

user issue

failure load the Qwen2-72B-Instruct with FP6 on 4 ARC GPU

(ipex-llm-0812) llm@GPU-Xeon4410Y-ARC770:~/ipex-llm-0812/python/llm/dev/benchmark/all-in-one$ bash run-deepspeed-arc.sh :: initializing oneAPI environment ... run-deepspeed-arc.sh: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for oneapi-vars.sh arguments: --force :: advisor -- processing etc/advisor/vars.sh :: ccl -- processing etc/ccl/vars.sh...

user issue

Failure to load the LLM model in vLLM on 8 ARC

With the ipex-llm docker container, intelanalytics/ipex-llm-serving-vllm-xpu-experiment:2.1.0b2 it successfully load model in 4 ARC. But when load model in 8 ARC, it will have the following error. root@GPU-Xeon4410Y-ARC770:/llm# bash start-vllm-service.sh /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13:...

user issue

vLLM 0.5.4 failure to start the TP+ PP mode on 8 ARC

### The vllm docker image is `intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1` ### vLLM start command is 'model="/llm/models/Qwen2-72B-Instruct/" served_model_name="Qwen2-72B-Instruct" source /opt/intel/1ccl-wks/setvars.sh export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model...

user issue

multi-arc

LLama-33B failure with vLLM 0.5.4 docker on 4 ARC GPU.

**The vLLM docker image is** `intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1` **vLLM start command is** `model="/llm/models/meta-llama/LLaMA-33B-HF/" served_model_name="LLaMA-33B-HF" source /opt/intel/1ccl-wks/setvars.sh export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model \ --trust-remote-code...

user issue

multi-arc