Mike Yang comments

Results 8 comments of


                                            Mike Yang

Eval bug: does llama.cpp support Intel AMX instruction? how to enable it

Just profile the llama.cpp for deepseek R1 with Q4_K_M, found it only use the AVX VNNI, not use the AMX instruction. ![Image](https://github.com/user-attachments/assets/f1ebd579-0d62-402e-95c4-fdcf8d17814b) ![Image](https://github.com/user-attachments/assets/381a0af0-83bc-4464-aaae-c5a7e7a6602a)

Eval bug: does llama.cpp support Intel AMX instruction? how to enable it

instead of the AMX instruction, the profile also show, many CPU time is waiting on the openmp GOMP_barrier, not full use the CPU core. I try to change the number...

mistral_model_forward_4_36() got an unexpected keyword argument 'cache_position'

transformer 4.36.2 have another issue with the mistral model. I had tested the transformers v4.40 can works with the mistral model.

Failure to load the LLM model in vLLM on 8 ARC

with latest IPEX-LLM, the following error during inference INFO 08-16 10:12:59 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0...

ipex-llm inference with deepspeed of Qwen1.5-32B consumes too many memory

This doesn't fix the issue. In the benchmark code, deepspeed.init_inference is the sync call. If you add sleep before this call, all the process will be synced after this call....

ipex-llm inference with deepspeed of Qwen1.5-32B consumes too many memory

The Linux kernel have some issue on the swap. I am using the Ubuntu 22.04 with both 6.5 and 6.8 kernel. If we use 512G swap disk file, the Linux...

ipex-llm inference with deepspeed of Qwen1.5-32B consumes too many memory

After search the Deepspeed document, there are some solution to reduce the Host CPU memory. https://joe-cecil.com/using-meta-tensors-to-load-models-that-dont-fit-in-memory/ Using meta tensors to load models that don't fit in memory PyTorch recently implemented...

vLLM 0.5.4 failure to start the TP+ PP mode on 8 ARC

After use modify the /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py with the `get_pp_group().all_gather(torch.zeros(1).xpu())` ### vLLM start with the following error 2024:09:14-11:02:35:( 241) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct,...