opencompass icon indicating copy to clipboard operation
opencompass copied to clipboard

[Bug] 使用max-num-worker使得ssh断连

Open timturing opened this issue 1 year ago • 3 comments

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

我需要并行评测的我的模型,因为评估单个example的时间太长,所以我参考了 #1755 中的代码,使用CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2这种方式进行并行。然而每次当我启动此代码时,这将导致我的ssh直接断连,并返回报错shell request failed on channel 0,感觉该问题与opencompass的code有关。

Reproduces the problem - code/configuration sample

CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2

Reproduces the problem - error message

shell request failed on channel 0

Other information

timturing avatar Feb 23 '25 04:02 timturing

Can you launch the evaluation with --debug?

tonysy avatar Feb 24 '25 09:02 tonysy

Yes, when running with --debug option, the ssh connection is fine. However, it will only use one GPU instead of the number I set in CUDA_VISIBLE_DEVICES and max-num-worker.

timturing avatar Feb 24 '25 10:02 timturing

Actually, I have no idea about this bug. Can you use tmux and then try to remove --debug?

tonysy avatar Feb 26 '25 07:02 tonysy