opencompass [Bug] Partition tasks sometimes fail due to occupied ports

[Bug] Partition tasks sometimes fail due to occupied ports

Open sdc17 opened this issue 1 year ago • 1 comments

Hi, thanks for sharing this great open-source project! When using multiple GPUs for evaluation, I found partition tasks sometimes will fail due to occupied ports.

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))"
{'CUDA available': False,
 'GCC': 'gcc (GCC) 5.4.0',
 'MMEngine': '0.10.2',
 'OpenCV': '4.9.0',
 'PyTorch': '2.1.2+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.1.1 (Git Hash '
                              '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wno-psabi '
                              '-Wno-error=pedantic -Wno-error=old-style-cast '
                              '-Wno-invalid-partial-specialization '
                              '-Wno-unused-private-field '
                              '-Wno-aligned-allocation-unavailable '
                              '-Wno-missing-braces -fdiagnostics-color=always '
                              '-faligned-new -Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Werror=cast-function-type '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '
                              'TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
                              'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',
 'Python': '3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) '
           '[GCC 12.3.0]',
 'TorchVision': '0.16.2+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.1+61fe873',
 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

from mmengine.config import read_base

with read_base():
    # from .datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets 
    from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets

    from .models.hf_llama.hf_llama2_7b import models
    from .summarizers.example import summarizer

datasets = sum([v for k, v in locals().items() if k.endswith("_datasets") or k == 'datasets'], [])
work_dir = './outputs/llama-2-7b-hf'

Reproduces the problem - command or script

python run.py configs/eval_hf_llama2.py --max-partition-size 2000 # fail occurs
python run.py configs/eval_hf_llama2.py --max-partition-size 4000 # fail occurs
python run.py configs/eval_hf_llama2.py --max-partition-size 8000 # success

Reproduces the problem - error message

I ran the above scripts on 8 GPUs, and partition tasks sometimes will fail due to occupied ports. For example, with --max-partition-size 4000:

01/19 03:44:14 - OpenCompass - INFO - Partitioned into 45 tasks.
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
01/19 03:44:28 - OpenCompass - WARNING - task OpenICLInfer[llama-2-7b-hf/triviaqa_6] fail, see
./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out
100%|██████████| 45/45 [27:16<00:00, 36.37s/it]  
launch OpenICLInfer[llama-2-7b-hf/triviaqa_32] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_33] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_34] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_35] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_36] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_37] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_38] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_39] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_40] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_41] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_42] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_43] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_44] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_24] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_23] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_25] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_26] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_28] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_31] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_27] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_30] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_29] on GPU 3
01/19 04:11:31 - OpenCompass - ERROR - /home/user/opencompass/opencompass/runners/base.py - summarize - 63 - OpenICLInfer[llama-2-7b-hf/triviaqa_6] failed with code 1
01/19 04:11:31 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:13<00:00, 13.94s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU 
dataset    version    metric    mode    llama-2-7b-hf
---------  ---------  --------  ------  ---------------

The output shows that partition triviaqa_6 failed, and ./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out shows:

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:42265 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:42265 (errno: 98 - Address already in use).

which indicates this problem is caused by the occupied port. However, with --max-partition-size 8000, everything is ok:

01/19 03:47:33 - OpenCompass - INFO - Partitioned into 23 tasks.
100%|██████████| 23/23 [24:31<00:00, 63.99s/it]  
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 5
01/19 04:12:05 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:14<00:00, 14.82s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU 
dataset    version    metric    mode      llama-2-7b-hf
---------  ---------  --------  ------  ---------------
triviaqa   2121ce     score     gen               52.45

Other information

I think this is not a server/dataset-related problem. I have checked that there were no residual processes on the server occupying ports before running the scripts. Besides, I have tried to adjust the range of available port numbers but the same problem still occurred. Furthermore, I have also tested mmlu dataset with different value of --max-partition-size, and the same problem also occured from time to time.

Any solutions to fix would be appreciated!

BTW, the mmlu acc I tested is exactly matched with the value listed on website. However, the triviaqa acc I tested (52.4) is slightly lower than the reported (52.8). I'm using the default settings, and I'm wondering if this level of difference is normal? Thanks in advance!

Jan 18 '24 20:01 sdc17

Hi, does the ports occupation is regular or random occurred

Apr 28 '24 16:04 bittersweet1999

opencompass opencompass copied to clipboard

[Bug] Partition tasks sometimes fail due to occupied ports

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

opencompass
opencompass copied to clipboard