Mike Yang
Mike Yang
### Describe the bug run the test script intel-extension-for-pytorch/examples/gpu/inference/python/llm/run_benchmark.sh It will have the following error Namespace(model_id='/home/llm/disk/llm/meta-llama/Llama-2-7b-hf', sub_model_name='llama2-7b', device='xpu', dtype='float16', input_tokens='1024', max_new_tokens=128, prompt=None, greedy=False, ipex=True, jit=False, profile=False, benchmark=True, lambada=False, dataset='lambada', num_beams=4,...
python/llm/example/GPU/Deepspeed-AutoTP/run_qwen_14b_arc_2_card.sh python/llm/example/GPU/Deepspeed-AutoTP/run_vicuna_33b_arc_2_card.sh python/llm/dev/benchmark/all-in-one/run-deepspeed-arc.sh Current the following code only enable on the Intel Core CPU. But on Intel Xeon CPU, also need enable the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS to improve performance. ``` if grep...
The current all-in-one benchmark save the csv file with the name only have date. if we run multi test in the same day, the older test data will be overwrite...
Every time when I run the test, it will load the original model and covert to lower bit. If we load a 34B model on 4 ARC card, it will...
(ipex-llm-0812) llm@GPU-Xeon4410Y-ARC770:~/ipex-llm-0812/python/llm/dev/benchmark/all-in-one$ bash run-deepspeed-arc.sh :: initializing oneAPI environment ... run-deepspeed-arc.sh: BASH_VERSION = 5.1.16(1)-release args: Using "$@" for oneapi-vars.sh arguments: --force :: advisor -- processing etc/advisor/vars.sh :: ccl -- processing etc/ccl/vars.sh...
With the ipex-llm docker container, intelanalytics/ipex-llm-serving-vllm-xpu-experiment:2.1.0b2 it successfully load model in 4 ARC. But when load model in 8 ARC, it will have the following error. root@GPU-Xeon4410Y-ARC770:/llm# bash start-vllm-service.sh /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13:...
### The vllm docker image is `intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1` ### vLLM start command is 'model="/llm/models/Qwen2-72B-Instruct/" served_model_name="Qwen2-72B-Instruct" source /opt/intel/1ccl-wks/setvars.sh export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model...
**The vLLM docker image is** `intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1` **vLLM start command is** `model="/llm/models/meta-llama/LLaMA-33B-HF/" served_model_name="LLaMA-33B-HF" source /opt/intel/1ccl-wks/setvars.sh export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \ --served-model-name $served_model_name \ --port 8000 \ --model $model \ --trust-remote-code...