opencompass [Bug] Prompt with trailing whitespace may hurt model performance

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Type

I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (GCC) 9.2.1 20200522 (Alibaba 9.2.1-3 2.17)',
 'GPU 0,1,2,3': 'NVIDIA A100-SXM4-80GB',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.9.0',
 'PyTorch': '2.1.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.1.1 (Git Hash '
                              '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.3\n'
                              '    - Built with CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=old-style-cast '
                              '-Wno-invalid-partial-specialization '
                              '-Wno-unused-private-field '
                              '-Wno-aligned-allocation-unavailable '
                              '-Wno-missing-braces -fdiagnostics-color=always '
                              '-faligned-new -Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]',
 'TorchVision': '0.16.0+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.1+',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

Evaluating my own model.

Reproduces the problem - command or script

python run.py --datasets agieval_gen \
--models $MY_MODEL \
--model-kwargs device_map='auto' \
--tokenizer-path $TOKENIZER_PATH \
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \
--max-out-len $MAX_OUT_LEN \
--max-seq-len 2048 \
--batch-size 8 \
--no-batch-padding \
--work-dir $WORK_DIR \

Reproduces the problem - error message

None

Other information

I'm evaluating on AGIEval and notice a performance drop under default config. Dig into predictions, I find that model generates unusual tokens, like multi white spaces or "\n".

https://github.com/open-compass/opencompass/blob/ba7cd58da3317bdec233d097153e2ab92c5f5dd5/configs/datasets/agieval/agieval_gen_64afd3.py#L72

The issue is gone when I remove the trailing whitespace. It seems like an OOD problem when a base model tries to predict under a situation not seen in the pre-training stage, which is also mentioned in this video. Go back to the original repo of AGIEval, there're no trailing whitespaces.

Feb 28 '24 05:02 yzlnew

You are right. The LLMs are sensitive to the prompt.

Feb 28 '24 05:02 tonysy

@tonysy Is it considered a bug of OpenCompass and getting fixed in future release? I've noticed several other datasets with prompt configured similarly, which could cause possible performance downgrade.

Feb 28 '24 06:02 yzlnew

I think it is not a bug, it's the issue of the LLM other than the evaluation. Actually, we may need introduce several different prompts to improve the robustness of the evaluation.

Feb 28 '24 14:02 tonysy

@tonysy I agree with this view. However, I want to point out that OpenCompass could give different results compared to the original version, and the prompts change with different datasets, such as those with or without trailing whitespace.

But, we can partially fix this issue in the tokenization stage. Therefore, model using a tokenizer that additionally processes trailing whitespace results in higher scores on the leaderboard, but it does not reflect the true capability of the model.

Feb 28 '24 14:02 yzlnew

Right, we are working on the prompt sensitivity and will provide multi-prompt result recently. Stay tuned.

Feb 29 '24 08:02 tonysy

@yzlnew It's a problem related to bpe dropout. Our paper has discussed this problem. https://arxiv.org/pdf/2404.03608

Jun 07 '24 16:06 longxudou

@longxudou Thanks. It seems like a simple but effective fix during the tokenization stage.

Jun 11 '24 02:06 yzlnew