opencompass
opencompass copied to clipboard
[Bug] Qwen3-8B的gsm8k只有34分,比官网技术报告的89.4差了很多
Prerequisite
- [x] I have searched Issues and Discussions but cannot get the expected help.
- [x] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
{'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda-12.8', 'GCC': 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0', 'GPU 0': 'NVIDIA GeForce RTX 5090 Laptop GPU', 'MMEngine': '0.10.7', 'MUSA available': False, 'NVCC': 'Cuda compilation tools, release 12.8, V12.8.93', 'OpenCV': '4.11.0', 'PyTorch': '2.7.0+cu128', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 11.2\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2024.2-Product Build 20240605 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.7.1 (Git Hash ' '8d263e693366ef8db40acc569cc7d8edf644556d)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX2\n' ' - CUDA Runtime 12.8\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_100,code=sm_100;-gencode;arch=compute_120,code=sm_120;-gencode;arch=compute_120,code=compute_120\n' ' - CuDNN 90.7.1\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, ' 'COMMIT_SHA=134179474539648ba7dee1317959529fbd0e7f89, ' 'CUDA_VERSION=12.8, CUDNN_VERSION=9.7.1, ' 'CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 ' '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL ' '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER ' '-DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM ' '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK ' '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC ' '-Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor ' '-Werror=range-loop-construct ' '-Werror=bool-operation -Wnarrowing ' '-Wno-missing-field-initializers ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wsuggest-override ' '-Wno-psabi -Wno-error=old-style-cast ' '-fdiagnostics-color=always -faligned-new ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'TORCH_VERSION=2.7.0, USE_CUDA=ON, USE_CUDNN=ON, ' 'USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, ' 'USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, ' 'USE_ROCM_KERNEL_ASSERT=OFF, \n', 'Python': '3.10.18 | packaged by conda-forge | (main, Jun 4 2025, 14:45:41) ' '[GCC 13.3.0]', 'TorchVision': '0.22.0+cu128', 'lmdeploy': "not installed:No module named 'lmdeploy'", 'numpy_random_seed': 2147483648, 'opencompass': '0.4.2+f30aea8', 'sys.platform': 'linux', 'transformers': '4.53.2'}
Reproduces the problem - code/configuration sample
我的opencompass/configs/models/vllm_qwen3_8b.py的代码如下:
opencompass/configs/models/qwen3_8b_local.py
from opencompass.models import VLLMwithChatTemplate from opencompass.utils.text_postprocessors import extract_non_reasoning_content
models = [ dict( type=VLLMwithChatTemplate, abbr='Qwen3-8B', path='/mnt/d/model/qwen/qwen3-8B/', model_kwargs=dict(tensor_parallel_size=1,max_model_len=16384), max_out_len=10240, batch_size=8, generation_kwargs=dict(do_sample=True,temperature=0.6,top_p=0.95,top_k=20,repetition_penalty=1.05,min_p=0), run_cfg=dict(num_gpus=1), meta_template=dict( round=[ dict(role='HUMAN', api_role='HUMAN'), dict(role='BOT', api_role='BOT', generate=True), ]), pred_postprocessor=dict(type=extract_non_reasoning_content) ) ]
Reproduces the problem - command or script
执行的评测代码如下:
export VLLM_USE_V1=0
export PYTHONUNBUFFERED=1
export MULTIPROCESSING_METHOD=spawn # 或者尝试 fork 如果 spawn 不行,但问题通常出在 fork 上
python run.py
--models vllm_qwen3_8b
--model-kwargs gpu_memory_utilization=0.9 enable-reasoning=True reasoning-parser=deepseek_r1
--datasets gsm8k_gen
--work-dir eval_result
--max-seq-len 16384
--debug
Reproduces the problem - error message
评测结果如下: dataset version metric mode Qwen3-8B
gsm8k 1d7fe4 accuracy gen 34.34 已经参考了best practice的配置,实验了好几次差别不大,用evalscope评测下来是90.23分,这个问题出在哪里?
Other information
No response
@Myhs-phz Please check this issue
你先看下result里有没有正确推理 以及 抽取到理想结果,再看评估方式对不对,整个流程看下是不是正确的