vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Producer process has been terminated before all shared CUDA tensors released (v 0.5.0 post1, v 0.4.3)

Open yaronr opened this issue 1 year ago • 3 comments

Your current environment

Docker Image: vllm/vllm-openai:v0.4.3 as well as 0.5.0 post-1

Params:

--model=microsoft/Phi-3-medium-4k-instruct 
--tensor-parallel-size=2
--disable-log-requests
--trust-remote-code
--max-model-len=2048
--gpu-memory-utilization=0.9

The container freezes (does nothing) after presenting the following exception in the log.

🐛 Describe the bug

Original exception was:
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

yaronr avatar Jul 01 '24 10:07 yaronr

can you follow https://docs.vllm.ai/en/latest/getting_started/debugging.html to figure out what is happening here?

youkaichao avatar Jul 06 '24 01:07 youkaichao

@youkaichao I am facing the same log as well. What is the general recommendation to remedy it if it's a critical issue? For my case, the programs runs fine with that message.

sayakpaul avatar Sep 11 '24 12:09 sayakpaul

For my case, the programs runs fine with that message.

then it's just a warning you can ignore.

youkaichao avatar Sep 11 '24 15:09 youkaichao

Same problem Running: `model_name = "allenai/Molmo-7B-D-0924"

llm = LLM( model=model_name, trust_remote_code=True, dtype="bfloat16", tensor_parallel_size=2 )`

Getting : INFO 11-05 15:35:36 config.py:1704] Downcasting torch.float32 to torch.bfloat16. INFO 11-05 15:35:41 config.py:944] Defaulting to use mp for distributed inference INFO 11-05 15:35:41 llm_engine.py:242] Initializing an LLM engine (v0.6.3.post2.dev127+g2adb4409) with config: model='allenai/Molmo-7B-D-0924', speculative_config=None, tokenizer='allenai/Molmo-7B-D-0924', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=allenai/Molmo-7B-D-0924, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None) WARNING 11-05 15:35:42 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 11-05 15:35:42 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=9216) INFO 11-05 15:35:42 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 utils.py:976] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 pynccl.py:63] vLLM is using nccl==2.21.5 INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=9216) INFO 11-05 15:35:43 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /home/jupyter/.cache/vllm/gpu_p2p_access_cache_for_0,1.json Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:336 'invalid argument' [rank0]:[W1105 15:35:43.442228048 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors] [rank1]:[W1105 15:35:43.442235220 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

On an instance with 2 Nvidia L4 GPUs. It kills my kernel.

OsaCode avatar Nov 05 '24 15:11 OsaCode

I upgraded my vLLM after reading https://github.com/vllm-project/vllm/issues/9774 and it fixed the issue, although I still crash for another reason

OsaCode avatar Nov 05 '24 15:11 OsaCode

For my case, the programs runs fine with that message.

then it's just a warning you can ignore.

what's the cause of this warning? When I convert an bge-m3 model to onnx with optimum, I also got this warnning.

chansonzhang avatar Jan 14 '25 08:01 chansonzhang

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Apr 15 '25 02:04 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar May 15 '25 02:05 github-actions[bot]