opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] Prompt with trailing whitespace may hurt model performance

Open yzlnew opened this issue 1 year ago • 7 comments

Prerequisite

Type

I have modified the code (config is not considered code), or I'm working on my own tasks/models/datasets.

Environment

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (GCC) 9.2.1 20200522 (Alibaba 9.2.1-3 2.17)',
 'GPU 0,1,2,3': 'NVIDIA A100-SXM4-80GB',
 'MMEngine': '0.10.3',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105',
 'OpenCV': '4.9.0',
 'PyTorch': '2.1.0+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.1.1 (Git Hash '
                              '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.3\n'
                              '    - Built with CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=old-style-cast '
                              '-Wno-invalid-partial-specialization '
                              '-Wno-unused-private-field '
                              '-Wno-aligned-allocation-unavailable '
                              '-Wno-missing-braces -fdiagnostics-color=always '
                              '-faligned-new -Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]',
 'TorchVision': '0.16.0+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.1+',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

Evaluating my own model.

Reproduces the problem - command or script

python run.py --datasets agieval_gen \
--models $MY_MODEL \
--model-kwargs device_map='auto' \
--tokenizer-path $TOKENIZER_PATH \
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False trust_remote_code=True \
--max-out-len $MAX_OUT_LEN \
--max-seq-len 2048 \
--batch-size 8 \
--no-batch-padding \
--work-dir $WORK_DIR \

Reproduces the problem - error message

None

Other information

I'm evaluating on AGIEval and notice a performance drop under default config. Dig into predictions, I find that model generates unusual tokens, like multi white spaces or "\n".

https://github.com/open-compass/opencompass/blob/ba7cd58da3317bdec233d097153e2ab92c5f5dd5/configs/datasets/agieval/agieval_gen_64afd3.py#L72

The issue is gone when I remove the trailing whitespace. It seems like an OOD problem when a base model tries to predict under a situation not seen in the pre-training stage, which is also mentioned in this video. Go back to the original repo of AGIEval, there're no trailing whitespaces.

yzlnew avatar Feb 28 '24 05:02 yzlnew

You are right. The LLMs are sensitive to the prompt.

tonysy avatar Feb 28 '24 05:02 tonysy

@tonysy Is it considered a bug of OpenCompass and getting fixed in future release? I've noticed several other datasets with prompt configured similarly, which could cause possible performance downgrade.

yzlnew avatar Feb 28 '24 06:02 yzlnew

I think it is not a bug, it's the issue of the LLM other than the evaluation. Actually, we may need introduce several different prompts to improve the robustness of the evaluation.

tonysy avatar Feb 28 '24 14:02 tonysy

@tonysy I agree with this view. However, I want to point out that OpenCompass could give different results compared to the original version, and the prompts change with different datasets, such as those with or without trailing whitespace.

But, we can partially fix this issue in the tokenization stage. Therefore, model using a tokenizer that additionally processes trailing whitespace results in higher scores on the leaderboard, but it does not reflect the true capability of the model.

yzlnew avatar Feb 28 '24 14:02 yzlnew

Right, we are working on the prompt sensitivity and will provide multi-prompt result recently. Stay tuned.

tonysy avatar Feb 29 '24 08:02 tonysy

@yzlnew It's a problem related to bpe dropout. Our paper has discussed this problem. https://arxiv.org/pdf/2404.03608

image

longxudou avatar Jun 07 '24 16:06 longxudou

@longxudou Thanks. It seems like a simple but effective fix during the tokenization stage.

yzlnew avatar Jun 11 '24 02:06 yzlnew