opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] MBPP score significantly lower than official results

Open GenerallyCovetous opened this issue 9 months ago • 1 comments

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

{'CUDA available': True, 'GCC': 'gcc (GCC) 7.3.0', 'MMEngine': '0.10.6', 'MUSA available': False, 'OpenCV': '4.11.0', 'PyTorch': '2.1.0', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 10.2\n' ' - C++ Version: 201703\n' ' - Intel(R) MKL-DNN v3.1.1 (Git Hash ' '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: NO AVX\n' ' - Build settings: BLAS_INFO=open, ' 'BUILD_TYPE=Release, ' 'CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER ' '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK ' '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE ' '-O2 -fPIC -Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wno-psabi ' '-Wno-error=pedantic -Wno-error=old-style-cast ' '-Wno-invalid-partial-specialization ' '-Wno-unused-private-field ' '-Wno-aligned-allocation-unavailable ' '-Wno-missing-braces -fdiagnostics-color=always ' '-faligned-new -Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Werror=cast-function-type ' '-Wno-stringop-overflow, LAPACK_INFO=open, ' 'TORCH_DISABLE_GPU_ASSERTS=ON, ' 'TORCH_VERSION=2.1.0, USE_CUDA=OFF, ' 'USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, \n', 'Python': '3.10.16 (main, Dec 11 2024, 16:18:56) [GCC 11.2.0]', 'TorchVision': '0.16.0', 'lmdeploy': "not installed:No module named 'lmdeploy'", 'numpy_random_seed': 2147483648, 'opencompass': '0.3.9+', 'sys.platform': 'linux', 'transformers': '4.48.0'}

Reproduces the problem - code/configuration sample

python run.py --models hf_llama3_1_8b --datasets sanitized_mbpp_gen_742f0c --debug

Reproduces the problem - command or script

python run.py --models hf_llama3_1_8b --datasets sanitized_mbpp_gen_742f0c --debug

Reproduces the problem - error message

When I was testing the base model for llama3.1-8b, I found that using the config file in the official readme.md came out with a score of only 43.58, while llama3-8b-turbomind in the official readme.md came out with a score of 54.86, which is an excessive difference. What is the reason for this gap in scores?

Image

Image

Other information

No response

GenerallyCovetous avatar Feb 07 '25 08:02 GenerallyCovetous