lm-evaluation-harness Process hangs when using `tensor_parallel_size` and `data_parallel

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together

Open harshakokel opened this issue 1 year ago • 9 comments

Hello,

I noticed that my process hangs at results = ray.get(object_refs) when I use data_parallel_size as well as tensor_parallel_size for vllm models.

For example, this call would hang.

lm_eval --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

These would not.

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=1,tensor_parallel_size=2 --tasks arc_easy --output ./trial/  --log_samples --limit 10

lm_eval  --model vllm --model_args pretrained=gpt2,data_parallel_size=2,tensor_parallel_size=1 --tasks arc_easy --output ./trial/  --log_samples --limit 10

Does anyone else face similar problem?

Apr 22 '24 21:04 harshakokel

Hi! What version of vLLM are you running with?

@baberabb has observed some problems like this before with later versions ( >v0.3.3 I believe) of vllm.

Apr 26 '24 15:04 haileyschoelkopf

I am on vllm 0.3.2.

Apr 26 '24 16:04 harshakokel

Is this a vllm problem? Should I be raising an issue on that repo?

Apr 26 '24 16:04 harshakokel

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

Apr 26 '24 17:04 baberabb

Yes, the weights are cached. The process is hanging after llm.generate returns results.

Apr 26 '24 17:04 harshakokel

Yes, the weights are cached. The process is hanging after llm.generate returns results.

hmm. It's working for me with 0.3.2. Have you tried running on a fresh virtual environment?

Apr 26 '24 18:04 baberabb

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Apr 26 '24 20:04 harshakokel

Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is ray==2.10.0

Probably the latest one. I installed it with pip install -e ".[vllm]" on runpod with 4 GPUs.

Apr 27 '24 12:04 baberabb

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Process hangs when using `tensor_parallel_size` and `data_parallel_size` together

lm-evaluation-harness
lm-evaluation-harness copied to clipboard