[Bug] MMLU 数据集在A100上评测运行日志显示某个批次的最后一个运行时就卡住不动，卡死了

Open yanchenmochen opened this issue 2 months ago • 1 comments

Prerequisite

[x] I have searched Issues and Discussions but cannot get the expected help.
[x] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

CUDA_VISIBLE_DEVICES="0,1,2,3" VLLM_WORKER_MULTIPROC_METHOD=spawn python run.py --models hf_deepseek_v2_lite_900 --datasets mmlu_openai_simple_evals_gen_b618ea

from opencompass.models import HuggingFaceBaseModel
from opencompass.models import VLLMwithChatTemplate
import torch

# models = [
#     dict(
#         type=VLLMwithChatTemplate,
#         abbr='deepseek-v2-lite-hf',
#         path='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/Deepseek-V2-Lite-tp1-pp4-ep2-iter900',
#         max_out_len=1024,
#         model_kwargs=dict(gpu_memory_utilization=0.6),
#         batch_size=16,
#         run_cfg=dict(num_gpus=4),
#     )
# ]

ckpt_dir='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/'
iter=1800
gpus=4
models = [
    dict(
        type=VLLMwithChatTemplate,
        abbr=f'Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}',
        path=f'{ckpt_dir}/Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}',
        model_kwargs=dict(
            tensor_parallel_size=gpus,
            gpu_memory_utilization=0.9),
        max_seq_len=4096,
        batch_size=32,
        generation_kwargs=dict(temperature=0),
        run_cfg=dict(num_gpus=gpus, num_procs=1),
    )
]

Reproduces the problem - code/configuration sample

from opencompass.models import HuggingFaceBaseModel from opencompass.models import VLLMwithChatTemplate import torch

models = [

dict(

type=VLLMwithChatTemplate,

abbr='deepseek-v2-lite-hf',

path='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/Deepseek-V2-Lite-tp1-pp4-ep2-iter900',

max_out_len=1024,

model_kwargs=dict(gpu_memory_utilization=0.6),

batch_size=16,

run_cfg=dict(num_gpus=4),

)

]

ckpt_dir='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/' iter=1800 gpus=4 models = [ dict( type=VLLMwithChatTemplate, abbr=f'Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}', path=f'{ckpt_dir}/Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}', model_kwargs=dict( tensor_parallel_size=gpus, gpu_memory_utilization=0.9), max_seq_len=4096, batch_size=32, generation_kwargs=dict(temperature=0), run_cfg=dict(num_gpus=gpus, num_procs=1), ) ]

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES="0,1,2,3" VLLM_WORKER_MULTIPROC_METHOD=spawn python run.py --models hf_deepseek_v2_lite_900 --datasets mmlu_openai_simple_evals_gen_b618ea

Reproduces the problem - error message

Processed prompts: 56%|█████▋ | 18/32 [00:00<00:00, 73.75it/s, est. speed input: 9841.84 toks/s, output: 444.10 toks/s][A

Processed prompts: 81%|████████▏ | 26/32 [00:00<00:00, 71.30it/s, est. speed input: 10341.34 toks/s, output: 725.32 toks/s][A

Processed prompts: 97%|█████████▋| 31/32 [00:19<00:00, 71.30it/s, est. speed input: 3938.93 toks/s, output: 411.35 toks/s] [A 就卡死不动了

Other information

No response

Sep 18 '25 12:09 yanchenmochen

opencompass opencompass copied to clipboard

[Bug] MMLU 数据集在A100上评测运行日志显示某个批次的最后一个运行时就卡住不动，卡死了

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

models = [

dict(

type=VLLMwithChatTemplate,

abbr='deepseek-v2-lite-hf',

path='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/Deepseek-V2-Lite-tp1-pp4-ep2-iter900',

max_out_len=1024,

model_kwargs=dict(gpu_memory_utilization=0.6),

batch_size=16,

run_cfg=dict(num_gpus=4),

)

]

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

opencompass
opencompass copied to clipboard