ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

测试DS-14B模型 AIME24 Task 夯住,无法测试完成

Open shawn9977 opened this issue 6 months ago • 5 comments

镜像:使用intelanalytics/ipex-llm-serving-xpu:0.8.3-b19 或者intelanalytics/ipex-llm-serving-xpu:0.8.3-b21镜像 模型: DeepSeek-R1-Distill-Qwen-14B FP16模型 工具: Lighteval 数据集 :AIME24 case。

无法完成测试, 一共测了三次 一次跑到73% 后无响应, 一次跑到 75%后无响应, 还有一次跑到 77%后无响应。

How to reproduce Steps to reproduce the error:

  1. ...
  2. ...
  3. ...
  4. ...

Screenshots

Image

Additional context Add any other context about the problem here.

shawn9977 avatar Jun 23 '25 10:06 shawn9977

Could you provide more details on how to reproduce it? Provide the bash script on how to start vllm service and how to use Lighteval to run AIME24 case.

hzjane avatar Jul 07 '25 02:07 hzjane

运行 Model: DeepSeek-R1-Distill-Qwen-32B 数据精度: INT4 or FP8 or FP16 Task: AIME24

会遇到同样得问题

shawn9977 avatar Jul 13 '25 01:07 shawn9977

Step1: docker run --rm -dit
--privileged
--net=host
--device=/dev/dri
--name=lighteval-b21
-v /home/shawn:/llm/shawn
-e no_proxy=localhost,127.0.0.1
-e http_proxy=$http_proxy
-e https_proxy=$http_proxy
--shm-size="32g"
--entrypoint /bin/bash
intelanalytics/ipex-llm-serving-xpu:0.8.3-b21

Step2: docker exec -it lighteval-b21 bash

Step3: git clone https://github.com/huggingface/lighteval.git cd lighteval

update src/lighteval/models/vllm/vllm_model.py https://github.com/huggingface/lighteval/compare/main...liu-shaojun:lighteval:ipex-llm?expand=1

pip install -e . pip install latex2sympy2_extended==1.0.6

export HF_ENDPOINT=https://hf-mirror.com huggingface-cli login

Step4: export USE_XETLA=OFF export SYCL_CACHE_PERSISTENT=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export FI_PROVIDER=shm export TORCH_LLM_ALLREDUCE=0 export CCL_WORKER_COUNT=1 export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 export CCL_SAME_STREAM=1 export CCL_BLOCKING_WAIT=0 export LOAD_IN_LOW_BIT="fp8" export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT export VLLM_USE_V1=0         source /opt/intel/1ccl-wks/setvars.sh

Step5: lighteval vllm "vllm_model_config.yaml" "lighteval|gpqa:diamond|0|0" 

==================================================== vllm_model_config.yaml model_parameters:   model_name: "/llm/shawn/models/DeepSeek-R1-Distill-Qwen-32B"   revision: "main"   distributed_executor_backend: "ray"   dtype: "float16"   tensor_parallel_size: 8   data_parallel_size: 1   pipeline_parallel_size: 1   gpu_memory_utilization: 0.90   max_model_length: 35585   swap_space: 4   seed: 42   trust_remote_code: True   use_chat_template: True   add_special_tokens: True   multichoice_continuations_start_space: False   pairwise_tokenization: False   subfolder: null   max_num_seqs: 8   max_num_batched_tokens: 35585   generation_parameters:     temperature: 0.6     top_p: 0.95     seed: 42     max_new_tokens: 32768 metrics_options:   yo: null

shawn9977 avatar Jul 21 '25 04:07 shawn9977

export USE_XETLA=OFF export SYCL_CACHE_PERSISTENT=1 export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2 export FI_PROVIDER=shm export TORCH_LLM_ALLREDUCE=0 export CCL_WORKER_COUNT=1 export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 export CCL_SAME_STREAM=1 export CCL_BLOCKING_WAIT=0 export LOAD_IN_LOW_BIT="fp8" export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT export VLLM_USE_V1=0         source /opt/intel/1ccl-wks/setvars.sh

I followed these steps and tried it. It ran for 9 hours overnight and was still running normally. What does the stuck log look like? Is it running slowly but still running? Maybe you can use xpu-smi dump -m 0,1,2,3,18 to monitor whether the xpu is working properly.

Processed prompts: 74%|█████████▋ | 147/198 [8:51:39<3:04:27, 217.00s/it, est. speed input: 1.40 toks/s, output: 76.83 toks/s]

hzjane avatar Jul 22 '25 01:07 hzjane

hi @shawn9977 已在邮件中回复了。

当你复现该问题时,建议可以使用 py-spy 工具来查看具体卡在了哪一步。操作步骤如下:

  1. 安装 py-spy(如尚未安装): pip install py-spy
  2. 找到 lighteval 主进程的 PID。
  3. 执行以下命令查看当前主线程的堆栈: py-spy dump --pid <PID> 查看当前 main thread 卡在的具体位置,便于进一步定位问题。

liu-shaojun avatar Jul 23 '25 02:07 liu-shaojun