opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] math_gen数据集评估随机失败

Open berton820 opened this issue 9 months ago • 4 comments

先决条件

  • [X] 我已经搜索过 问题讨论 但未得到预期的帮助。
  • [X] 错误在 最新版本 中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda', 'GCC': 'gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0', 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA A800-SXM4-80GB', 'MMEngine': '0.10.4', 'MUSA available': False, 'NVCC': 'Cuda compilation tools, release 11.7, V11.7.64', 'OpenCV': '4.9.0', 'PyTorch': '2.3.0+cu121', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2022.2-Product Build 20220804 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.3.6 (Git Hash ' '86e6af5974177e513fd3fee58425e1063e7f1361)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX512\n' ' - CUDA Runtime 12.1\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n' ' - CuDNN 8.9.2\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, CUDA_VERSION=12.1, ' 'CUDNN_VERSION=8.9.2, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM ' '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK ' '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE ' '-O2 -fPIC -Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wsuggest-override ' '-Wno-psabi -Wno-error=pedantic ' '-Wno-error=old-style-cast -Wno-missing-braces ' '-fdiagnostics-color=always -faligned-new ' '-Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, ' 'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, ' 'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, ' 'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, ' 'USE_ROCM_KERNEL_ASSERT=OFF, \n', 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]', 'TorchVision': '0.18.0+cu121', 'numpy_random_seed': 2147483648, 'opencompass': '0.2.4+', 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

math_gen数据集 跑 qwen1.5-1.8B官方模型

重现问题 - 命令或脚本

CUDA_VISIBLE_DEVICES=0 python run.py
--datasets math_gen
--hf-path local_Qwen1.5
--tokenizer-path local_Qwen1.5
--work-dir ./outputs/
--model-kwargs device_map='auto'
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False
--max-out-len 100
--max-seq-len 2048
--batch-size 8
--no-batch-padding
--num-gpus 1

重现问题 - 错误信息

opencompass/opencompass/runners/base.py - summarize - 64 - OpenICLInfer[opencompass.models.huggingface.HuggingFace_download_Qwen1.5-1.8B/math_24] failed with code 1

opencompass/opencompass/tasks/openicl_eval.py - _score - 239 - Task [opencompass.models.huggingface.HuggingFace_download_Qwen1.5-1.8B/math]: preds and refrs have different length

image image image

其他信息

我在基于opencompass评测qwen1.5官方未改动模型,math数据集会分成好几块,每一次跑的时候都会有不同的切块报错如下:本次就是math_24报错,之前还有math_6报错等

berton820 avatar May 14 '24 09:05 berton820

Have you changed your partition logic midway? If you run it all at once, this problem shouldn't occur

liushz avatar May 14 '24 09:05 liushz

Have you changed your partition logic midway? If you run it all at once, this problem shouldn't occur

  1. I havenot made any changes
  2. to avoid unexpected bugs, i have also rm ~/.cache before running script

berton820 avatar May 14 '24 09:05 berton820

The error log for your eval stage is because there are some errors during your infer stage, so the length of prediction is different with refs, you can check the following log: image

liushz avatar May 14 '24 09:05 liushz

The error log for your eval stage is because there are some errors during your infer stage, so the length of prediction is different with refs, you can check the following log: image

Hi liushz, log is here, bug i cannot get the point image


W0513 14:40:06.153000 139835420161856 torch/distributed/elastic/agent/server/api.py:741] Received Signals.SIGHUP death signal, shutting down workers W0513 14:40:06.154000 139835420161856 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2678994 closing signal SIGHUP Traceback (most recent call last): File "/home/jovyan/anaconda3/envs/opencompass/bin/torchrun", line 8, in sys.exit(main()) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent result = agent.run() File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = f(*args, **kwargs) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run result = self._invoke_run(role) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 876, in _invoke_run time.sleep(monitor_interval) File "/home/jovyan/anaconda3/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 76, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2678928 got signal: 1

berton820 avatar May 14 '24 10:05 berton820