vllm
vllm copied to clipboard
[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3)
Your current environment
Docker Image: vllm/vllm-openai:v0.4.3 as well as 0.5.0 post-1
Params:
--model=microsoft/Phi-3-medium-4k-instruct
--tensor-parallel-size=2
--disable-log-requests
--trust-remote-code
--max-model-len=2048
--gpu-memory-utilization=0.9
The container freezes (does nothing) after presenting the following exception in the log.
🐛 Describe the bug
Original exception was:
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out what is happening here?
@youkaichao I am facing the same log as well. What is the general recommendation to remedy it if it's a critical issue? For my case, the programs runs fine with that message.
For my case, the programs runs fine with that message.
then it's just a warning you can ignore.
Same problem Running: `model_name = "allenai/Molmo-7B-D-0924"
llm = LLM( model=model_name, trust_remote_code=True, dtype="bfloat16", tensor_parallel_size=2 )`
Getting :
INFO 11-05 15:35:36 config.py:1704] Downcasting torch.float32 to torch.bfloat16. INFO 11-05 15:35:41 config.py:944] Defaulting to use mp for distributed inference INFO 11-05 15:35:41 llm_engine.py:242] Initializing an LLM engine (v0.6.3.post2.dev127+g2adb4409) with config: model='allenai/Molmo-7B-D-0924', speculative_config=None, tokenizer='allenai/Molmo-7B-D-0924', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=allenai/Molmo-7B-D-0924, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None) WARNING 11-05 15:35:42 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 15:35:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=9216) INFO 11-05 15:35:42 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' [rank0]:[W1105 15:35:43.442228048 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank1]:[W1105 15:35:43.442235220 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
On an instance with 2 Nvidia L4 GPUs. It kills my kernel.
I upgraded my vLLM after reading https://github.com/vllm-project/vllm/issues/9774 and it fixed the issue, although I still crash for another reason
For my case, the programs runs fine with that message.
then it's just a warning you can ignore.
what's the cause of this warning? When I convert an bge-m3 model to onnx with optimum, I also got this warnning.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!