vllm
vllm copied to clipboard
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
Your current environment
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.30.0
Libc version: glibc-2.35
Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.5.82
CUDA_MODULE_LOADING set to: LAZY
GPU models:
A100s
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] transformers==4.42.4
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
🐛 Describe the bug
I am trying to understand the vllm's workflow for distributed serving via multiprocessing. The original setup is deploying a model with tensor parallel size = 2 through Triton Inference Server and distributed_executor_backend: mp
. While inference is going well, when server is shutting down , 2 processes pt_main_thread
are not killed and their status is State: S (sleeping)
.
The closes reproducer outside of Triton is this:
from vllm import SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.utils import random_uuid
import time
import asyncio
SAMPLING_PARAMETERS = {"temperature": 0, "top_p": 1}
VLLM_ENGINE_CONFIG = {
"model":"facebook/opt-125m",
"disable_log_requests": "true",
"gpu_memory_utilization": 0.5,
"enforce_eager": "true",
"tensor_parallel_size":2
}
PROMPTS = [
"The most dangerous animal is",
"The capital of France is",
"The future of AI is",
]
async def generate_python_vllm_output(prompt, llm_engine):
request_id = random_uuid()
sampling_params = SamplingParams(**SAMPLING_PARAMETERS)
python_vllm_output = None
last_output = None
async for vllm_output in llm_engine.generate(prompt, sampling_params, request_id):
last_output = vllm_output
if last_output:
python_vllm_output = [
(prompt + output.text).encode("utf-8") for output in last_output.outputs
]
return python_vllm_output
llm_engine = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**VLLM_ENGINE_CONFIG))
python_vllm_output = []
for i in range(len(PROMPTS * 1000)):
python_vllm_output.extend(
asyncio.run(generate_python_vllm_output(PROMPTS[i], llm_engine))
)
And the workflow is the following:
# ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
21346 pts/0 00:00:00 top
21927 pts/0 00:00:00 top
22463 pts/0 00:00:00 ps
# python3 vllm_reproducer.py &
...
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.38it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 7.37it/s]
INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
(VllmWorkerProcess pid=22534) INFO 07-25 00:18:58 model_runner.py:692] Loading model weights took 0.1202 GB
INFO 07-25 00:18:58 distributed_gpu_executor.py:56] # GPU blocks: 68037, # CPU blocks: 14563
# pkill -9 python3
# ps
PID TTY TIME CMD
1 pts/0 00:00:00 bash
21346 pts/0 00:00:00 top
21927 pts/0 00:00:00 top
22465 pts/0 00:00:22 pt_main_thread
22534 pts/0 00:00:14 pt_main_thread
22576 pts/0 00:00:00 python3 <defunct>
22745 pts/0 00:00:00 ps
And same, the above 2 processes are in the sleeping state based on cat /proc/_PID_/status
Any insights on vllm's distributed serving with multiprocessing is greatly appreciated.