opencompass
opencompass copied to clipboard
[Bug] MMLU 数据集在A100上评测运行日志显示某个批次的最后一个运行时就卡住不动,卡死了
Prerequisite
- [x] I have searched Issues and Discussions but cannot get the expected help.
- [x] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
CUDA_VISIBLE_DEVICES="0,1,2,3" VLLM_WORKER_MULTIPROC_METHOD=spawn python run.py --models hf_deepseek_v2_lite_900 --datasets mmlu_openai_simple_evals_gen_b618ea
from opencompass.models import HuggingFaceBaseModel
from opencompass.models import VLLMwithChatTemplate
import torch
# models = [
# dict(
# type=VLLMwithChatTemplate,
# abbr='deepseek-v2-lite-hf',
# path='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/Deepseek-V2-Lite-tp1-pp4-ep2-iter900',
# max_out_len=1024,
# model_kwargs=dict(gpu_memory_utilization=0.6),
# batch_size=16,
# run_cfg=dict(num_gpus=4),
# )
# ]
ckpt_dir='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/'
iter=1800
gpus=4
models = [
dict(
type=VLLMwithChatTemplate,
abbr=f'Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}',
path=f'{ckpt_dir}/Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}',
model_kwargs=dict(
tensor_parallel_size=gpus,
gpu_memory_utilization=0.9),
max_seq_len=4096,
batch_size=32,
generation_kwargs=dict(temperature=0),
run_cfg=dict(num_gpus=gpus, num_procs=1),
)
]
Reproduces the problem - code/configuration sample
from opencompass.models import HuggingFaceBaseModel from opencompass.models import VLLMwithChatTemplate import torch
models = [
dict(
type=VLLMwithChatTemplate,
abbr='deepseek-v2-lite-hf',
path='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/Deepseek-V2-Lite-tp1-pp4-ep2-iter900',
max_out_len=1024,
model_kwargs=dict(gpu_memory_utilization=0.6),
batch_size=16,
run_cfg=dict(num_gpus=4),
)
]
ckpt_dir='/mnt/seed-program-nas/001688/zn/output_tulu3/checkpoint/finetune-mcore-deepseek-v2-A2.4B-lr-5e-6-minlr-1e-6-bs-2-gbs-1024-seqlen-4096-pr-bf16-tp-1-pp-4-cp-1-ac-sel-do-true-sp-true-ti-10000-wi-100/' iter=1800 gpus=4 models = [ dict( type=VLLMwithChatTemplate, abbr=f'Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}', path=f'{ckpt_dir}/Deepseek-V2-Lite-tp1-pp4-ep2-iter{iter}', model_kwargs=dict( tensor_parallel_size=gpus, gpu_memory_utilization=0.9), max_seq_len=4096, batch_size=32, generation_kwargs=dict(temperature=0), run_cfg=dict(num_gpus=gpus, num_procs=1), ) ]
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES="0,1,2,3" VLLM_WORKER_MULTIPROC_METHOD=spawn python run.py --models hf_deepseek_v2_lite_900 --datasets mmlu_openai_simple_evals_gen_b618ea
Reproduces the problem - error message
Processed prompts: 56%|█████▋ | 18/32 [00:00<00:00, 73.75it/s, est. speed input: 9841.84 toks/s, output: 444.10 toks/s][A
Processed prompts: 81%|████████▏ | 26/32 [00:00<00:00, 71.30it/s, est. speed input: 10341.34 toks/s, output: 725.32 toks/s][A
Processed prompts: 97%|█████████▋| 31/32 [00:19<00:00, 71.30it/s, est. speed input: 3938.93 toks/s, output: 411.35 toks/s] [A 就卡死不动了
Other information
No response