Mike Yang

Results 8 comments of Mike Yang

Just profile the llama.cpp for deepseek R1 with Q4_K_M, found it only use the AVX VNNI, not use the AMX instruction. ![Image](https://github.com/user-attachments/assets/f1ebd579-0d62-402e-95c4-fdcf8d17814b) ![Image](https://github.com/user-attachments/assets/381a0af0-83bc-4464-aaae-c5a7e7a6602a)

instead of the AMX instruction, the profile also show, many CPU time is waiting on the openmp GOMP_barrier, not full use the CPU core. I try to change the number...

transformer 4.36.2 have another issue with the mistral model. I had tested the transformers v4.40 can works with the mistral model.

with latest IPEX-LLM, the following error during inference INFO 08-16 10:12:59 metrics.py:217] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0...

This doesn't fix the issue. In the benchmark code, deepspeed.init_inference is the sync call. If you add sleep before this call, all the process will be synced after this call....

The Linux kernel have some issue on the swap. I am using the Ubuntu 22.04 with both 6.5 and 6.8 kernel. If we use 512G swap disk file, the Linux...

After search the Deepspeed document, there are some solution to reduce the Host CPU memory. https://joe-cecil.com/using-meta-tensors-to-load-models-that-dont-fit-in-memory/ Using meta tensors to load models that don't fit in memory PyTorch recently implemented...

After use modify the /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg/vllm/worker/xpu_worker.py with the `get_pp_group().all_gather(torch.zeros(1).xpu())` ### vLLM start with the following error 2024:09:14-11:02:35:( 241) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct,...