vllm [Misc]: Strange `leaked shared_memory` warnings reported by multiprocessing when using vLLM

Anything you want to discuss about vllm.

Here is a simple example to run vLLM.

When I add import multiprocessing and set tensor_paralle_size > 1 (in my code, the value is 2), I meet annoying warnings leaked shared_memory.

When I remove the import multiprocessing or set tensor_paralle_size=1, everything is OK. (plz note that I say or)

I am not sure whether this warning would cause future memory problems?

Thx for any attention!

import multiprocessing  # one of the variants that trigger warnings
from vllm import LLM, SamplingParams
from transformers import (
    AutoTokenizer,
)


tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

sampling_path = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05,  max_tokens=512)

model = LLM(model="Qwen/Qwen2.5-Coder-7B-Instruct", dtype="half", tensor_parallel_size=2)  # tensor_parallel_size is the other variant that trigger warnings

message = [
            {"role": "system", "content": "your are a helpful assistant"},
            {"role": "user", "content": "hello world!"}
]

text = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)

outputs = model.generate([text], sampling_params=sampling_path)

print(outputs[0].outputs[0].text)

warning log

INFO 09-25 12:12:15 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[rank0]:[W925 12:12:17.096459736 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/root/anaconda3/envs/vllm_quick/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sep 25 '24 12:09 shaoyuyoung

PS: I am currently using QWen model, not using any other model

Sep 25 '24 12:09 shaoyuyoung

I am encountering the same issue.

Oct 05 '24 19:10 chadqiu

I think if necessary, I need to open a new issue labeled as bug to catch the developers' attention.

Because actually I have encountered some problems caused by this issue.....

Oct 07 '24 09:10 shaoyuyoung

Thanks @shaoyuyoung, this is a known issue but as far as I know is benign. Could you elaborate on the problems that you've encountered from this?

Help with digging into the root cause and figuring out how to avoid these messages would be welcome!

Oct 08 '24 00:10 njhill

Thanks @shaoyuyoung, this is a known issue but as far as I know is benign. Could you elaborate on the problems that you've encountered from this?

Help with digging into the root cause and figuring out how to avoid these messages would be welcome!

I use this command to start a service, and then kill the process. This issue will appear. nohup vllm serve Qwen/Qwen2.5-14B-Instruct --served-model-name Qwen/Qwen2.5-14B-Instruct --enable-auto-tool-choice --tool-call-parser hermes --tensor-parallel-size 4 >> server.log 2>&1 &

Oct 08 '24 01:10 chadqiu

same here

Oct 15 '24 10:10 zwhe99

I am encountering the same issue when querying vllm server. Then my server terminated unexpectedly.

Oct 21 '24 06:10 Skytliang

I am encountering the same issue when querying vllm server. Then my server terminated unexpectedly.

@Skytliang can u give some code or steps to reproduce it? Maybe the developers need this :)

Oct 21 '24 06:10 shaoyuyoung

@Skytliang can u give some code or steps to reproduce it? Maybe the developers need this :)

code

export VLLM_LOGGING_LEVEL=DEBUG
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=TRACE
export VLLM_TRACE_FUNCTION=1

vllm serve $model_path \
        --tensor-parallel-size 8 \
        --enforce-eager \

log

INFO 10-21 14:42:25 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 10-21 14:42:25 engine.py:215] Waiting for new requests in engine loop.
DEBUG 10-21 14:42:27 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:29 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:31 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:33 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:35 client.py:170] Waiting for output from MQLLMEngine.
DEBUG 10-21 14:42:35 client.py:154] Heartbeat successful.
INFO 10-21 14:42:35 metrics.py:345] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 10-21 14:42:35 engine.py:215] Waiting for new requests in engine loop.
DEBUG 10-21 14:42:37 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:39 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:41 client.py:154] Heartbeat successful.
DEBUG 10-21 14:42:43 client.py:154] Heartbeat successful.
INFO 10-21 14:42:44 logger.py:37] Received request cmpl-1226127cfd2e4032a7ae338bc48a4404-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=7, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 24661, 13175, 374, 264], lora_request: None, prompt_adapter_request: None.
DEBUG 10-21 14:42:44 async_llm_engine.py:525] Building guided decoding logits processor. Params: GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
INFO 10-21 14:42:44 engine.py:292] Added request cmpl-1226127cfd2e4032a7ae338bc48a4404-0.
DEBUG 10-21 14:42:45 client.py:170] Waiting for output from MQLLMEngine.
ERROR 10-21 14:42:53 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 10-21 14:42:53 client.py:250] NoneType: None
DEBUG 10-21 14:42:53 client.py:144] Shutting down MQLLMEngineClient check health loop due to timeout
DEBUG 10-21 14:42:55 client.py:170] Waiting for output from MQLLMEngine.
CRITICAL 10-21 14:42:55 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     9.218.231.135:24050 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [24735]
/opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 32 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/opt/conda/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Oct 21 '24 07:10 Skytliang

Any progress?

Oct 21 '24 08:10 Jeremy-J-J

same here

Nov 15 '24 08:11 hayreenlee

same here with qwen-2.5-7B

Nov 17 '24 12:11 Chris-ChengJia

running into this with qwen-2.5-3B-Instruct Update: The issue in my case was that i was running vllm from the cli and had way too low ulimit for open files for the OS (they were at 1024, raised then the file descriptor limit to 65535)

Dec 04 '24 16:12 RinoFM

Issue with Qwen GPTQ guys... also tried AWQ, but same there.

Dec 31 '24 07:12 AayushSameerShah

same here with llama 3 7B Intruct.

Jan 21 '25 17:01 copperwiring

I get the same

Jan 29 '25 08:01 manon-reusens

Thanks @shaoyuyoung, this is a known issue but as far as I know is benign. Could you elaborate on the problems that you've encountered from this? Help with digging into the root cause and figuring out how to avoid these messages would be welcome!

I use this command to start a service, and then kill the process. This issue will appear. nohup vllm serve Qwen/Qwen2.5-14B-Instruct --served-model-name Qwen/Qwen2.5-14B-Instruct --enable-auto-tool-choice --tool-call-parser hermes --tensor-parallel-size 4 >> server.log 2>&1 &

Same as you, and I can't start a new vllm server now

Mar 18 '25 02:03 mhsbz

Same problem. Any method to fix this?

Apr 08 '25 13:04 CrystalSixone

I don't think this is a bug. It is an OOM error.

Apr 14 '25 15:04 Skytliang

I don't think this is a bug. It is an OOM error.

Sorry, I don't think so. Bc the progress executes successfully on my device but throws this warning at the end, which makes me annoying :(

Apr 15 '25 05:04 shaoyuyoung

Same here

Apr 30 '25 10:04 sait1801

Found the solution here: https://github.com/vllm-project/vllm/issues/8933 and here https://stackoverflow.com/questions/52421068/error-in-slurm-cluster-detected-1-oom-kill-events-how-to-improve-running-jo/62133895#62133895

Apr 30 '25 10:04 sait1801

I fixed it by decreasing the --max-model-len parameter. It indeed seems to be an OOM issue.

Jun 26 '25 09:06 vitaminzl

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Sep 25 '25 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Oct 25 '25 02:10 github-actions[bot]