opencompass
opencompass copied to clipboard
[Bug] Partition tasks sometimes fail due to occupied ports
Hi, thanks for sharing this great open-source project! When using multiple GPUs for evaluation, I found partition tasks sometimes will fail due to occupied ports.
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
python -c "import opencompass.utils;import pprint;pprint.pprint(dict(opencompass.utils.collect_env()))"
{'CUDA available': False,
'GCC': 'gcc (GCC) 5.4.0',
'MMEngine': '0.10.2',
'OpenCV': '4.9.0',
'PyTorch': '2.1.2+cu121',
'PyTorch compiling details': 'PyTorch built with:\n'
' - GCC 9.3\n'
' - C++ Version: 201703\n'
' - Intel(R) oneAPI Math Kernel Library Version '
'2022.2-Product Build 20220804 for Intel(R) 64 '
'architecture applications\n'
' - Intel(R) MKL-DNN v3.1.1 (Git Hash '
'64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n'
' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
' - LAPACK is enabled (usually provided by '
'MKL)\n'
' - NNPACK is enabled\n'
' - CPU capability usage: AVX512\n'
' - Build settings: BLAS_INFO=mkl, '
'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
'CUDNN_VERSION=8.9.2, '
'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
'-fabi-version=11 -fvisibility-inlines-hidden '
'-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
'-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
'-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
'-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
'-O2 -fPIC -Wall -Wextra -Werror=return-type '
'-Werror=non-virtual-dtor -Werror=bool-operation '
'-Wnarrowing -Wno-missing-field-initializers '
'-Wno-type-limits -Wno-array-bounds '
'-Wno-unknown-pragmas -Wno-unused-parameter '
'-Wno-unused-function -Wno-unused-result '
'-Wno-strict-overflow -Wno-strict-aliasing '
'-Wno-stringop-overflow -Wno-psabi '
'-Wno-error=pedantic -Wno-error=old-style-cast '
'-Wno-invalid-partial-specialization '
'-Wno-unused-private-field '
'-Wno-aligned-allocation-unavailable '
'-Wno-missing-braces -fdiagnostics-color=always '
'-faligned-new -Wno-unused-but-set-variable '
'-Wno-maybe-uninitialized -fno-math-errno '
'-fno-trapping-math -Werror=format '
'-Werror=cast-function-type '
'-Wno-stringop-overflow, LAPACK_INFO=mkl, '
'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
'PERF_WITH_AVX512=1, '
'TORCH_DISABLE_GPU_ASSERTS=ON, '
'TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, '
'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '
'USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, '
'USE_OPENMP=ON, USE_ROCM=OFF, \n',
'Python': '3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) '
'[GCC 12.3.0]',
'TorchVision': '0.16.2+cu121',
'numpy_random_seed': 2147483648,
'opencompass': '0.2.1+61fe873',
'sys.platform': 'linux'}
Reproduces the problem - code/configuration sample
from mmengine.config import read_base
with read_base():
# from .datasets.mmlu.mmlu_ppl_ac766d import mmlu_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .models.hf_llama.hf_llama2_7b import models
from .summarizers.example import summarizer
datasets = sum([v for k, v in locals().items() if k.endswith("_datasets") or k == 'datasets'], [])
work_dir = './outputs/llama-2-7b-hf'
Reproduces the problem - command or script
python run.py configs/eval_hf_llama2.py --max-partition-size 2000 # fail occurs
python run.py configs/eval_hf_llama2.py --max-partition-size 4000 # fail occurs
python run.py configs/eval_hf_llama2.py --max-partition-size 8000 # success
Reproduces the problem - error message
I ran the above scripts on 8 GPUs, and partition tasks sometimes will fail due to occupied ports. For example, with --max-partition-size 4000:
01/19 03:44:14 - OpenCompass - INFO - Partitioned into 45 tasks.
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
01/19 03:44:28 - OpenCompass - WARNING - task OpenICLInfer[llama-2-7b-hf/triviaqa_6] fail, see
./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out
100%|██████████| 45/45 [27:16<00:00, 36.37s/it]
launch OpenICLInfer[llama-2-7b-hf/triviaqa_32] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_33] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_34] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_35] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_36] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_37] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_38] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_39] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_40] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_41] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_42] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_43] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_44] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_24] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_23] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_25] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_26] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_28] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_31] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_27] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_30] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_29] on GPU 3
01/19 04:11:31 - OpenCompass - ERROR - /home/user/opencompass/opencompass/runners/base.py - summarize - 63 - OpenICLInfer[llama-2-7b-hf/triviaqa_6] failed with code 1
01/19 04:11:31 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:13<00:00, 13.94s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU
dataset version metric mode llama-2-7b-hf
--------- --------- -------- ------ ---------------
The output shows that partition triviaqa_6 failed, and ./outputs/llama-2-7b-hf/triviaqa/20240119_034414/logs/infer/llama-2-7b-hf/triviaqa_6.out shows:
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:42265 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:42265 (errno: 98 - Address already in use).
which indicates this problem is caused by the occupied port. However, with --max-partition-size 8000, everything is ok:
01/19 03:47:33 - OpenCompass - INFO - Partitioned into 23 tasks.
100%|██████████| 23/23 [24:31<00:00, 63.99s/it]
launch OpenICLInfer[llama-2-7b-hf/triviaqa_0] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_1] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_2] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_3] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_4] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_5] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_6] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_7] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_20] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_18] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_17] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_21] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_19] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_9] on GPU 5
launch OpenICLInfer[llama-2-7b-hf/triviaqa_10] on GPU 6
launch OpenICLInfer[llama-2-7b-hf/triviaqa_11] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_8] on GPU 4
launch OpenICLInfer[llama-2-7b-hf/triviaqa_12] on GPU 3
launch OpenICLInfer[llama-2-7b-hf/triviaqa_22] on GPU 0
launch OpenICLInfer[llama-2-7b-hf/triviaqa_14] on GPU 7
launch OpenICLInfer[llama-2-7b-hf/triviaqa_13] on GPU 1
launch OpenICLInfer[llama-2-7b-hf/triviaqa_16] on GPU 2
launch OpenICLInfer[llama-2-7b-hf/triviaqa_15] on GPU 5
01/19 04:12:05 - OpenCompass - INFO - Partitioned into 1 tasks.
100%|██████████| 1/1 [00:14<00:00, 14.82s/it]
launch OpenICLEval[llama-2-7b-hf/triviaqa] on CPU
dataset version metric mode llama-2-7b-hf
--------- --------- -------- ------ ---------------
triviaqa 2121ce score gen 52.45
Other information
I think this is not a server/dataset-related problem. I have checked that there were no residual processes on the server occupying ports before running the scripts. Besides, I have tried to adjust the range of available port numbers but the same problem still occurred. Furthermore, I have also tested mmlu dataset with different value of --max-partition-size, and the same problem also occured from time to time.
Any solutions to fix would be appreciated!
BTW, the mmlu acc I tested is exactly matched with the value listed on website. However, the triviaqa acc I tested (52.4) is slightly lower than the reported (52.8). I'm using the default settings, and I'm wondering if this level of difference is normal? Thanks in advance!
Hi, does the ports occupation is regular or random occurred