opencompass
opencompass copied to clipboard
[Bug] Alignbench无法使用VLLM模型评测,eval阶段卡住并报错
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
opencompass 0.2.6
Ubuntu 20.04
python 3.10.14
重现问题 - 代码/配置示例
config 文件:
from mmengine.config import read_base
with read_base():
from .datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI
from opencompass.models.openai_api import OpenAIAllesAPIN
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import AlignmentBenchSummarizer
# -------------Inference Stage ----------------------------------------
# For subjective evaluation, we often set do sample for models
from opencompass.models import VLLM
_meta_template = dict(
round=[
dict(role="HUMAN", begin='<|im_start|>user\n', end='<|im_end|>\n'),
dict(role="BOT", begin="<|im_start|>assistant\n", end='<|im_end|>\n', generate=True),
],
eos_token_id=151645,
)
GPU_NUMS = 4
stop_list = ['<|im_end|>', '</s>', '<|endoftext|>']
models = [
dict(
type=VLLM,
abbr='xxx',
path='xxx',
model_kwargs=dict(tensor_parallel_size=GPU_NUMS, disable_custom_all_reduce=True, enforce_eager=True),
meta_template=_meta_template,
max_out_len=1024,
max_seq_len=2048,
batch_size=GPU_NUMS * 8,
generation_kwargs=dict(temperature=0.1, top_p=0.9, skip_special_tokens=False, stop=stop_list),
stop_words=stop_list,
run_cfg=dict(num_gpus=GPU_NUMS, num_procs=1),
)
]
datasets = [*alignbench_datasets]
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
judge_models = [
dict(
type=VLLM,
abbr='CritiqueLLM',
path='/xxx/models/CritiqueLLM',
model_kwargs=dict(tensor_parallel_size=GPU_NUMS, disable_custom_all_reduce=True, enforce_eager=True),
meta_template=_meta_template,
max_out_len=1024,
max_seq_len=2048,
batch_size=GPU_NUMS * 8,
generation_kwargs=dict(temperature=0.1, top_p=0.9, skip_special_tokens=False, stop=stop_list),
run_cfg=dict(num_gpus=GPU_NUMS, num_procs=1),
)
]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=AlignmentBenchSummarizer)
work_dir = 'outputs/alignment_bench/'
重现问题 - 命令或脚本
python run.py configs/eval_xxx.py --debug --dump-eval-details
重现问题 - 错误信息
第一次报错了,第二次我使用 -m eval -r xxx
复用之前的 prediction 结果,单独运行 eval 还是报下面的错
07/08 21:37:23 - OpenCompass - INFO - Reusing experiements from 20240708_211011
07/08 21:37:23 - OpenCompass - INFO - Current exp folder: outputs/alignment_bench/20240708_211011
07/08 21:37:23 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners
07/08 21:37:23 - OpenCompass - DEBUG - Get class `SubjectiveNaivePartitioner` from "partitioner" registry in "opencompass"
07/08 21:37:23 - OpenCompass - DEBUG - An `SubjectiveNaivePartitioner` instance is built from registry, and its implementation can be found in opencompass.partitioners.sub_naive
07/08 21:37:23 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored.
07/08 21:37:23 - OpenCompass - DEBUG - Key eval.given_pred not found in config, ignored.
07/08 21:37:23 - OpenCompass - DEBUG - Additional config: {'eval': {'runner': {'task': {'dump_details': True}}}}
07/08 21:37:23 - OpenCompass - INFO - Partitioned into 1 tasks.
07/08 21:37:23 - OpenCompass - DEBUG - Task 0: [firefly_qw14b_chat_self_build_rl_dpo_full_b06_240705/alignment_bench]
07/08 21:37:23 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners
07/08 21:37:23 - OpenCompass - DEBUG - Get class `LocalRunner` from "runner" registry in "opencompass"
07/08 21:37:23 - OpenCompass - DEBUG - An `LocalRunner` instance is built from registry, and its implementation can be found in opencompass.runners.local
07/08 21:37:23 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks
07/08 21:37:23 - OpenCompass - DEBUG - Get class `SubjectiveEvalTask` from "task" registry in "opencompass"
07/08 21:37:23 - OpenCompass - DEBUG - An `SubjectiveEvalTask` instance is built from registry, and its implementation can be found in opencompass.tasks.subjective_eval
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
07/08 21:37:51 - OpenCompass - INFO - No postprocessor found.
2024-07-08 21:37:55,725 INFO worker.py:1743 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 07-08 21:37:59 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/maindata/data/shared/Security-SFT/cmz/models/CritiqueLLM', speculative_config=None, tokenizer='/maindata/data/shared/Security-SFT/cmz/models/CritiqueLLM', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/maindata/data/shared/Security-SFT/cmz/models/CritiqueLLM)
WARNING 07-08 21:38:00 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
(pid=2330) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=2330) Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
(pid=3478) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=3478) Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
(pid=3565) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=3565) Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
(pid=3652) Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
(pid=3652) Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
INFO 07-08 21:38:30 utils.py:660] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
(RayWorkerWrapper pid=3478) INFO 07-08 21:38:30 utils.py:660] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
INFO 07-08 21:38:30 selector.py:27] Using FlashAttention-2 backend.
(RayWorkerWrapper pid=3478) INFO 07-08 21:38:36 selector.py:27] Using FlashAttention-2 backend.
(RayWorkerWrapper pid=3652) INFO 07-08 21:38:30 utils.py:660] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
【卡在这里非常久,然后报下面的错】
[E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
ERROR 07-08 21:48:35 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 07-08 21:48:35 worker_base.py:145] Traceback (most recent call last):
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 07-08 21:48:35 worker_base.py:145] return executor(*args, **kwargs)
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
ERROR 07-08 21:48:35 worker_base.py:145] init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
ERROR 07-08 21:48:35 worker_base.py:145] init_distributed_environment(parallel_config.world_size, rank,
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.init_process_group(
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
ERROR 07-08 21:48:35 worker_base.py:145] return func(*args, **kwargs)
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
ERROR 07-08 21:48:35 worker_base.py:145] func_return = func(*args, **kwargs)
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
ERROR 07-08 21:48:35 worker_base.py:145] store, rank, world_size = next(rendezvous_iterator)
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
ERROR 07-08 21:48:35 worker_base.py:145] store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store
ERROR 07-08 21:48:35 worker_base.py:145] tcp_store = TCPStore(hostname, port, world_size, False, timeout)
ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
Traceback (most recent call last):
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] Traceback (most recent call last):
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py", line 450, in <module>
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] return executor(*args, **kwargs)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] init_distributed_environment(parallel_config.world_size, rank,
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.init_process_group(
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] return func(*args, **kwargs)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] func_return = func(*args, **kwargs)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] store, rank, world_size = next(rendezvous_iterator)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] tcp_store = TCPStore(hostname, port, world_size, False, timeout)
(RayWorkerWrapper pid=3478) ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
(RayWorkerWrapper pid=3652) INFO 07-08 21:38:36 selector.py:27] Using FlashAttention-2 backend. [repeated 2x across cluster]
(RayWorkerWrapper pid=3478) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
inferencer.run()
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py", line 94, in run
self._score(model_cfg, dataset_cfg, eval_cfg, output_column,
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py", line 379, in _score
icl_evaluator = ICL_EVALUATORS.build(eval_cfg['evaluator'])
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/openicl/icl_evaluator/lm_evaluator.py", line 109, in __init__
model = build_model_from_cfg(model_cfg=judge_cfg)
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/utils/build.py", line 25, in build_model_from_cfg
return MODELS.build(model_cfg)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/models/vllm.py", line 37, in __init__
self._load_model(path, model_kwargs)
File "/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/models/vllm.py", line 60, in _load_model
self.model = LLM(path, **model_kwargs)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 123, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 292, in from_engine_args
engine = cls(
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
self.model_executor = executor_class(
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
self._init_workers_ray(placement_group)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
self._run_workers("init_device")
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
raise e
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
return executor(*args, **kwargs)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
init_distributed_environment(parallel_config.world_size, rank,
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
torch.distributed.init_process_group(
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
func_return = func(*args, **kwargs)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169).
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] return executor(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] init_worker_distributed_environment(self.parallel_config, self.rank, [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] init_distributed_environment(parallel_config.world_size, rank, [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.init_process_group( [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper [repeated 4x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] return func(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] func_return = func(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] store, rank, world_size = next(rendezvous_iterator) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 170, in _create_c10d_store [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] tcp_store = TCPStore(hostname, port, world_size, False, timeout) [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) ERROR 07-08 21:48:35 worker_base.py:145] torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169). [repeated 2x across cluster]
(RayWorkerWrapper pid=3652) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.0.11.17, 44169). [repeated 2x across cluster]
E0708 21:48:40.958000 140381132564288 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 115) of binary: /maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/bin/python
Traceback (most recent call last):
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/opencompass/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/maindata/data/shared/Security-SFT/cmz/opencompass/opencompass/tasks/subjective_eval.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-08_21:48:40
host : eflops16
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 115)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
07/08 21:48:41 - OpenCompass - DEBUG - Get class `AlignmentBenchSummarizer` from "partitioner" registry in "opencompass"
07/08 21:48:41 - OpenCompass - DEBUG - An `AlignmentBenchSummarizer` instance is built from registry, and its implementation can be found in opencompass.summarizers.subjective.alignmentbench
outputs/alignment_bench/20240708_211011/results/firefly_qw14b_chat_self_build_rl_dpo_full_b06_240705_judged-by--CritiqueLLM is not exist! please check!
其他信息
No response