opencompass [Bug] The program frequently stops running during execution

trafficstars

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

The provided environment information doesn't represent the actual running machines; however, their images and Python environments are consistent, with the only difference being the number of GPUs.

{'CUDA available': True, 'CUDA_HOME': '/usr/local/cuda', 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0', 'GPU 0,1': 'NVIDIA A100-SXM4-80GB', 'MMEngine': '0.9.1', 'NVCC': 'Cuda compilation tools, release 11.8, V11.8.89', 'OpenCV': '4.8.1', 'PyTorch': '2.1.0', 'PyTorch compiling details': 'PyTorch built with:\n' ' - GCC 9.3\n' ' - C++ Version: 201703\n' ' - Intel(R) oneAPI Math Kernel Library Version ' '2023.1-Product Build 20230303 for Intel(R) 64 ' 'architecture applications\n' ' - Intel(R) MKL-DNN v3.1.1 (Git Hash ' '64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n' ' - OpenMP 201511 (a.k.a. OpenMP 4.5)\n' ' - LAPACK is enabled (usually provided by ' 'MKL)\n' ' - NNPACK is enabled\n' ' - CPU capability usage: AVX512\n' ' - CUDA Runtime 11.8\n' ' - NVCC architecture flags: ' '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37\n' ' - CuDNN 8.7\n' ' - Magma 2.6.1\n' ' - Build settings: BLAS_INFO=mkl, ' 'BUILD_TYPE=Release, CUDA_VERSION=11.8, ' 'CUDNN_VERSION=8.7.0, ' 'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, ' 'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 ' '-fabi-version=11 -fvisibility-inlines-hidden ' '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO ' '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM ' '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK ' '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE ' '-O2 -fPIC -Wall -Wextra -Werror=return-type ' '-Werror=non-virtual-dtor -Werror=bool-operation ' '-Wnarrowing -Wno-missing-field-initializers ' '-Wno-type-limits -Wno-array-bounds ' '-Wno-unknown-pragmas -Wno-unused-parameter ' '-Wno-unused-function -Wno-unused-result ' '-Wno-strict-overflow -Wno-strict-aliasing ' '-Wno-stringop-overflow -Wno-psabi ' '-Wno-error=pedantic -Wno-error=old-style-cast ' '-Wno-invalid-partial-specialization ' '-Wno-unused-private-field ' '-Wno-aligned-allocation-unavailable ' '-Wno-missing-braces -fdiagnostics-color=always ' '-faligned-new -Wno-unused-but-set-variable ' '-Wno-maybe-uninitialized -fno-math-errno ' '-fno-trapping-math -Werror=format ' '-Werror=cast-function-type ' '-Wno-stringop-overflow, LAPACK_INFO=mkl, ' 'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, ' 'PERF_WITH_AVX512=1, ' 'TORCH_DISABLE_GPU_ASSERTS=ON, ' 'TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, ' 'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, ' 'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, ' 'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, ' 'USE_OPENMP=ON, USE_ROCM=OFF, \n', 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]', 'TorchVision': '0.16.0', 'numpy_random_seed': 2147483648, 'opencompass': '0.1.8+19ad7f9', 'sys.platform': 'linux'}

Reproduces the problem - code/configuration sample

It occurs in the majority of the runtime, but due to the absence of logs, I'm not sure what happened. All I know is the observed behavior: in the presence of unfinished tasks, the GPU no longer seems to be utilized. It appears that there may be an issue with task scheduling.

Reproduces the problem - command or script

python run.py --models llama2_13b_hf --summarizer leaderboard -r -l  -w ./outputs/main --datasets \
SuperGLUE_COPA_ppl piqa_ppl siqa_ppl strategyqa_gen bbh_gen mmlu_ppl SuperGLUE_MultiRC_ppl race_ppl drop_gen \
obqa_ppl squad20_gen Xsum_gen  lambada_gen leval longbench civilcomments_clp tydiqa_gen XLSum_gen \
SuperGLUE_BoolQ_ppl commonsenseqa_ppl SuperGLUE_AX_b_ppl SuperGLUE_AX_g_ppl storycloze_ppl triviaqa_gen

Reproduces the problem - error message

Only the GPU is not being utilized, and this persists for a considerable duration

Other information

Initially, all 8 GPUs are operational, but as time progresses, they gradually cease to function.

It's important to note that the tasks haven't completed; the tasks (containers) automatically terminate even before their execution is finished.

This issue arises both in the initial run and in subsequent runs (since the GPU is already unoccupied after the first run).

Additionally, it seems that there is a specific sequence for the execution of 'commonsenseqa_ppl'? I encounter failures each time I run it, and it requires running it again for normal operation.

Error message like this:

Traceback (most recent call last):
  File "-/opencompass/opencompass/tasks/openicl_infer.py", line 148, in <module>
    inferencer.run()
  File "-/opencompass/opencompass/tasks/openicl_infer.py", line 78, in run
    self._inference()
  File "-opencompass/opencompass/tasks/openicl_infer.py", line 96, in _inference
    retriever = ICL_RETRIEVERS.build(retriever_cfg)
  File "-/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "-/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "-/opencompass/opencompass/openicl/icl_retriever/icl_mdl_retriever.py", line 73, in __init__
    super().__init__(dataset, ice_separator, ice_eos_token, ice_num,
  File "/-/opencompass/opencompass/openicl/icl_retriever/icl_topk_retriever.py", line 78, in __init__
    self.model = SentenceTransformer(sentence_transformers_model_name)
  File "-lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 87, in __init__
    snapshot_download(model_name_or_path,
  File "-/python3.10/site-packages/sentence_transformers/util.py", line 491, in snapshot_download
    path = cached_download(**cached_download_args)
  File "-/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "-/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 757, in cached_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.
[2023-11-15 16:32:58,769] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 68088) of binary: -/bin/python
Traceback (most recent call last):
  File "-bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0', 'console_scripts', 'torchrun')())
  File "-/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "-/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "-b/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "-/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "-/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
-opencompass/opencompass/tasks/openicl_infer.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-15_16:32:58
  host      : t-20231115174409-7gr47-worker-0.t-20231115174409-7gr47-worker.mlplatform-customtask.svc.cluster.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 68088)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

However, running the script once all tasks have completed seems to allow for normal execution.

In summary, I have found that failures in certain datasets (tasks) can be resolved by rerunning without any code modifications. There seems to be some issues with this.

Nov 16 '23 02:11 Sniper970119

Thanks.

The error message indicates that there exists internet connection issue

huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

For CommonsenseQA, the default involves using MDLRetriever, which necessitates an additional model for clustering and retrieval. This is a known issue (#550), and we plan to address it shortly. In the meantime, you can opt for this dataset configuration by employing FixKRetriever.

from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import commonsenseqaDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess

commonsenseqa_reader_cfg = dict(
    input_columns=["question", "A", "B", "C", "D", "E"],
    output_column="answerKey",
    test_split="validation")

_ice_template = dict(
    type=PromptTemplate,
    template=dict(
        begin="</E>",
        round=[
            dict(
                role="HUMAN",
                prompt=
                "{question}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nE. {E}\nAnswer:",
            ),
            dict(
                role="BOT",
                prompt="{answerKey}",
            ),
        ],
    ),
    ice_token="</E>",
)

commonsenseqa_infer_cfg = dict(
    ice_template=_ice_template,
    retriever=dict(type=FixKRetriever, fix_id_list=[0, 1, 2, 3, 4, 5, 6, 7, 8]),
    inferencer=dict(type=GenInferencer),
)

commonsenseqa_eval_cfg = dict(
    evaluator=dict(type=AccEvaluator),
    pred_postprocessor=dict(type=first_capital_postprocess),
)

commonsenseqa_datasets = [
    dict(
        type=commonsenseqaDataset,
        path="commonsense_qa",
        reader_cfg=commonsenseqa_reader_cfg,
        infer_cfg=commonsenseqa_infer_cfg,
        eval_cfg=commonsenseqa_eval_cfg,
    )
]

del _ice_template

Nov 16 '23 02:11 tonysy

tks, but what are the results of "Initially, all 8 GPUs are operational, but as time progresses, they gradually cease to function."? This hinders the automated execution of the program, leading to a significant waste of manpower for manual inspection. Is it also due to network issues? However, if errors caused by network issues occur, it seems that OpenCampass would skip them directly

Nov 16 '23 03:11 Sniper970119

OpenCompass involves a process where several model and dataset pairs are encapsulated into a task, and then tasks are executed sequentially as the smallest unit.

I see that you are running OpenCompass locally, in which case the LocalRunner is likely being used. This runner divides 8 GPUs into either 4 or 8 groups (depending on how many GPUs your model requires) and then assigns one task per group for execution. Once a task is completed, the next task is picked up, and so on. When there are no remaining tasks to be executed, the GPUs will enter a stopped state. The code for LocalRunner can be found here: https://github.com/open-compass/opencompass/blob/main/opencompass/runners/local.py.

I'm not entirely sure if this is the situation you are encountering. If you wish to alleviate this issue, you could modify the method of task partitioning and combination. Documentation for this can be found here: https://opencompass.readthedocs.io/en/latest/get_started/faq.html#efficiency, and the relevant code here: https://github.com/open-compass/opencompass/blob/main/opencompass/partitioners/size.py.

Nov 16 '23 03:11 Leymore

Yes, I understand that the GPU goes into a dormant state when there are no tasks. However, as mentioned earlier, the tasks are far from complete. In fact, there are still several hundred tasks pending.

Although it seems that there might be some issues with task allocation (I'm not entirely sure if it's a problem, but at times, one GPU seems to be assigned many tasks while others only get one task; however, this may not be crucial),

launch OpenICLInfer[llama2_13b/commonsense_qa] on GPU 0                                                                    
launch OpenICLInfer[llama2_13b/siqa,llama2_13b/race-middle,llama2_13b/tyidqa-goldp_korean,llama2_13b/LEval_tpo,llama2_13b/LEval_tpo]  on GPU 1

What is important is that multiple GPUs are stopping their operation successively, even when the tasks are far from completion. The screenshot from the Volcano engine above illustrates this point (the volcengine automatically deletes containers after task execution is completed, so there won't be any timeline records after that).

Nov 16 '23 03:11 Sniper970119

In OpenCompass, during the transition from one task to another, different processes are used. This means that it first exits the process corresponding to the previous task, and then initiates the process for the next task to execute it. Therefore, in the lifecycle of OpenCompass, each GPU should execute multiple tasks or multiple processes.

It's unclear whether the issue you're encountering is due to "the process corresponding to the previous task not exiting correctly" or "the process for the next task not starting correctly". I am also unsure about the relationship between the "processes" in OpenCompass and the concept of "containers" in the Volcano Engine, and whether there could be a monitoring issue.

Nov 16 '23 03:11 Leymore

1.It doesn't seem to be a monitoring issue because using the command line tool nvidia-smi also reveals that the GPU is not functioning properly. 2.Volcano Engine containers are perhaps similar to docker? Each task initiates a container, and it is automatically reclaimed after completion. 3.How should I monitor whether these processes are correctly terminated/started? It seems that this issue has significantly impacted the automation of my task execution.

Nov 16 '23 04:11 Sniper970119

You may take this as a signal of the previous process terminated properly:

https://github.com/open-compass/opencompass/blob/main/opencompass/runners/local.py#L135-L136

Cancel the redirect here , and you may take this as a signal of the next process started successfully:

https://github.com/open-compass/opencompass/blob/main/opencompass/tasks/openicl_infer.py#L58

Nov 16 '23 04:11 Leymore

It seems that there is an issue with task initiation. For example, this task output 'launch OpenICLInfer' in the command line, but it appears that no new task has been initiated. In the 'infer' folder, the file seems to have a termination marker, but there are no new task files added.

In the end, all GPUs will be affected until the issue is manually detected and the task is restarted.

after restarted, still have task to run. The numbers also match up, suggesting that it could corroborate the issue with task initiation. (161-24-8=129) (total task)- (already finish) - (current task, and is already finished, the tqdm number will update when new task is init)

Nov 17 '23 01:11 Sniper970119

Similar to the situation below, a task started at 10 o'clock, and from 10 to 11 o'clock, 4 GPUs had a utilization rate of 0, while the remaining 4 GPUs continued to run until 2:30. The logs indicate that during this period, tasks seemed to be allocated only to these 4 GPUs, and the other 4 GPUs were completely idle and wasted

@Leymore @tonysy Can you take a look to determine the root cause? This issue is significantly impacting the automation of tasks.

Nov 17 '23 06:11 Sniper970119

Thanks. Would you like to provide an example config, we can try the config to re-implement this issue.

Nov 17 '23 09:11 tonysy

The bash script is as follows. It seems to be dataset-agnostic

export HF_HOME=/xxxx/hf_cache
export TRANSFORMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1


python run.py --models hf_llama2_13b --summarizer leaderboard -r -l  -w ./outputs/main --datasets \
SuperGLUE_COPA_ppl piqa_ppl siqa_ppl strategyqa_gen bbh_gen mmlu_ppl SuperGLUE_MultiRC_ppl race_ppl drop_gen \
obqa_ppl squad20_gen Xsum_gen  lambada_gen leval longbench civilcomments_clp tydiqa_gen XLSum_gen \
SuperGLUE_BoolQ_ppl commonsenseqa_ppl SuperGLUE_AX_b_ppl SuperGLUE_AX_g_ppl storycloze_ppl triviaqa_gen

In addition, I changed the paths within the model to local paths, rather than using remote paths from Hugging Face.

Nov 17 '23 09:11 Sniper970119

opencompass opencompass copied to clipboard

[Bug] The program frequently stops running during execution

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

opencompass
opencompass copied to clipboard