[Bug] 使用max-num-worker使得ssh断连
Prerequisite
- [x] I have searched Issues and Discussions but cannot get the expected help.
- [x] The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
我需要并行评测的我的模型,因为评估单个example的时间太长,所以我参考了 #1755 中的代码,使用CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2这种方式进行并行。然而每次当我启动此代码时,这将导致我的ssh直接断连,并返回报错shell request failed on channel 0,感觉该问题与opencompass的code有关。
Reproduces the problem - code/configuration sample
CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES=6,7 opencompass --models vllm_qwen2_5_0_5b_instruct --datasets triviaqa_gen -a vllm --max-num-worker 2
Reproduces the problem - error message
shell request failed on channel 0
Other information
Can you launch the evaluation with --debug?
Yes, when running with --debug option, the ssh connection is fine. However, it will only use one GPU instead of the number I set in CUDA_VISIBLE_DEVICES and max-num-worker.
Actually, I have no idea about this bug. Can you use tmux and then try to remove --debug?