[Misc]: Strange `leaked shared_memory` warnings reported by multiprocessing when using vLLM
Anything you want to discuss about vllm.
Here is a simple example to run vLLM.
When I add import multiprocessing and set tensor_paralle_size > 1 (in my code, the value is 2), I meet annoying warnings leaked shared_memory.
When I remove the import multiprocessing or set tensor_paralle_size=1, everything is OK. (plz note that I say or)
I am not sure whether this warning would cause future memory problems?
Thx for any attention!
import multiprocessing # one of the variants that trigger warnings
from vllm import LLM, SamplingParams
from transformers import (
AutoTokenizer,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
sampling_path = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
model = LLM(model="Qwen/Qwen2.5-Coder-7B-Instruct", dtype="half", tensor_parallel_size=2) # tensor_parallel_size is the other variant that trigger warnings
message = [
{"role": "system", "content": "your are a helpful assistant"},
{"role": "user", "content": "hello world!"}
]
text = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
outputs = model.generate([text], sampling_params=sampling_path)
print(outputs[0].outputs[0].text)
warning log
INFO 09-25 12:12:15 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W925 12:12:17.096459736 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/root/anaconda3/envs/vllm_quick/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
PS: I am currently using QWen model, not using any other model
I am encountering the same issue.
I think if necessary, I need to open a new issue labeled as bug to catch the developers' attention.
Because actually I have encountered some problems caused by this issue.....
Thanks @shaoyuyoung, this is a known issue but as far as I know is benign. Could you elaborate on the problems that you've encountered from this?
Help with digging into the root cause and figuring out how to avoid these messages would be welcome!
Thanks @shaoyuyoung, this is a known issue but as far as I know is benign. Could you elaborate on the problems that you've encountered from this?
Help with digging into the root cause and figuring out how to avoid these messages would be welcome!
I use this command to start a service, and then kill the process. This issue will appear. nohup vllm serve Qwen/Qwen2.5-14B-Instruct --served-model-name Qwen/Qwen2.5-14B-Instruct --enable-auto-tool-choice --tool-call-parser hermes --tensor-parallel-size 4 >> server.log 2>&1 &
same here
I am encountering the same issue when querying vllm server. Then my server terminated unexpectedly.
I am encountering the same issue when querying vllm server. Then my server terminated unexpectedly.
@Skytliang can u give some code or steps to reproduce it? Maybe the developers need this :)
@Skytliang can u give some code or steps to reproduce it? Maybe the developers need this :)
code
export VLLM_LOGGING_LEVEL=DEBUG
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=TRACE
export VLLM_TRACE_FUNCTION=1
vllm serve $model_path \
--tensor-parallel-size 8 \
--enforce-eager \
log
INFO 10-21 14:42:25 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 10-21 14:42:25 engine.py:215] Waiting for new requests in engine loop.
DEBUG 10-21 14:42:27 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:29 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:31 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:33 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:35 client.py:170] Waiting for output from MQLLMEngine.
DEBUG 10-21 14:42:35 client.py:154] Heartbeat successful.
INFO 10-21 14:42:35 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 10-21 14:42:35 engine.py:215] Waiting for new requests in engine loop.
DEBUG 10-21 14:42:37 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:39 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:41 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:43 client.py:154] Heartbeat successful.
INFO 10-21 14:42:44 logger.py:37] Received request cmpl-1226127cfd2e4032a7ae338bc48a4404-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 24661, 13175, 374, 264], lora_request: None, prompt_adapter_request: None.
DEBUG 10-21 14:42:44 async_llm_engine.py:525] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 10-21 14:42:44 engine.py:292] Added request cmpl-1226127cfd2e4032a7ae338bc48a4404-0.
DEBUG 10-21 14:42:45 client.py:170] Waiting for output from MQLLMEngine.
ERROR 10-21 14:42:53 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 10-21 14:42:53 client.py:250] NoneType: None
DEBUG 10-21 14:42:53 client.py:144] Shutting down MQLLMEngineClient check health loop due to timeout
DEBUG 10-21 14:42:55 client.py:170] Waiting for output from MQLLMEngine.
CRITICAL 10-21 14:42:55 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 9.218.231.135:24050 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [24735]
/opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Any progress?
same here
same here with qwen-2.5-7B
running into this with qwen-2.5-3B-Instruct Update: The issue in my case was that i was running vllm from the cli and had way too low ulimit for open files for the OS (they were at 1024, raised then the file descriptor limit to 65535)
Issue with Qwen GPTQ guys... also tried AWQ, but same there.
same here with llama 3 7B Intruct.
I get the same
Thanks @shaoyuyoung, this is a known issue but as far as I know is benign. Could you elaborate on the problems that you've encountered from this? Help with digging into the root cause and figuring out how to avoid these messages would be welcome!
I use this command to start a service, and then kill the process. This issue will appear. nohup vllm serve Qwen/Qwen2.5-14B-Instruct --served-model-name Qwen/Qwen2.5-14B-Instruct --enable-auto-tool-choice --tool-call-parser hermes --tensor-parallel-size 4 >> server.log 2>&1 &
Same as you, and I can't start a new vllm server now
Same problem. Any method to fix this?
I don't think this is a bug. It is an OOM error.
I don't think this is a bug. It is an OOM error.
Sorry, I don't think so. Bc the progress executes successfully on my device but throws this warning at the end, which makes me annoying :(
Same here
Found the solution here: https://github.com/vllm-project/vllm/issues/8933 and here https://stackoverflow.com/questions/52421068/error-in-slurm-cluster-detected-1-oom-kill-events-how-to-improve-running-jo/62133895#62133895
I fixed it by decreasing the --max-model-len parameter. It indeed seems to be an OOM issue.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!