DeepSeek-R1-Distill-Qwen-32B FP8 或者FP16 用Harness 跑MMLU task, 显示OOM

Open shawn9977 opened this issue 6 months ago • 1 comments

镜像：使用 intelanalytics/ipex-llm-serving-xpu:0.8.3-b21镜像模型: DeepSeek-R1-Distill-Qwen-32B 数据精度：FP8 或者FP16 工具： lm-evaluation-harness 数据集：MMLU

问题：使用 Harness 评估DS-32B INT4模型精度的时候, 跑不起来，跑一会显示 OOM

Jul 13 '25 01:07 shawn9977

您好，关于您在使用 lm-evaluation-harness 跑 MMLU 数据集评估 DeepSeek-R1-Distill-Qwen-32B FP8 模型时遇到 OOM 的问题，以下是一些说明和建议：

GPU 显存大致可以划分为以下几个部分：

A. 模型权重加载区：用于加载模型的 weights。
B. Profile 预留区：用于初始化阶段的 profiling。
C. KV Cache 区：用于存储推理过程中 token 的 KV 缓存，分配方式为 剩余显存 × gpu_memory_utilization。
D. 中间变量区：用于存放如输入 context、activation、output 等中间结果。其显存来自于 (剩余显存 × (1 - gpu_memory_utilization)) + B。

在评估 MMLU 时，数据集 context 默认集中加载到 rank 0 上，因此会额外占用 D 区的显存。

我们之前在跑 32B INT4 模型时能够成功，是因为手动调整了 gpu_memory_utilization 参数，使得显存划分更加合理。

情况：模型加载正常，但在 init KV cache 时报错：

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

我们在测试环境中，使用 4 卡 + gpu_memory_utilization=0.7 的设置下，能够成功跑通 32B 模型的 FP8 评估。建议您尝试这个值起步，并适当观察显存分配情况，微调该参数以达到最佳平衡。

Jul 14 '25 06:07 liu-shaojun