vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[P/D][V1] KV Connector API V1

Open ApostaC opened this issue 8 months ago • 17 comments

TL;DR:

This PR opens the KV connector API in v1 to support disaggregated prefill. It also includes a minimal functional implementation as an example of how to use the connector API.

Detailed design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk

This PR is co-authored by:

TODOs in the upcoming PRs

  • [ ] More performant connector implementation using P2P connections
  • [ ] MLA support
  • [ ] Enable KVCacheManager allocating temporary blocks for the connector to use

Key design choices

  • Implement disagg prefill under the hood of v1's prefix caching and chunked prefill semantics The vLLM scheduler calculates which set of tokens needs a KV store or KV load, and the workers perform the actual KV store or load operations.
  • Provide layer-wise async API support
  • KV cache prefetching and request orchestration should happen outside vLLM so that the changes in the core can be minimized

High-level design of the KV connector in v1

The figure below shows the high-level design of the connector image

In the design, every process in vLLM will have a corresponding connector. Specifically, we have

  • Scheduler connector: the connector that locates in the same process as the scheduler process. It schedules the KV cache transfer ops.
  • Worker connectors: the connectors that locate in the worker processes. They execute KV cache transfer ops.

Scheduler connector

On prefill nodes, the scheduler connector needs to parse the scheduler's output and determine what tokens should have their KV cache transmitted to the decoder nodes.

On decoder nodes, the scheduler connector needs to return the "correct" num_computed_tokens and computed_blocks when calling get_computed_tokens.

Worker connector

The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load:

Working with outside orchestrator

In more advanced use cases like xPyD, the connector may need to know which decoder node to send the KV cache to from the outside orchestrator. We believe different infrastructure providers may have very different orchestrating logics, and thus such logic should reside outside of vLLM.

The figure below explains the workflow among the orchestrator, vLLM, and the connector:

image

At a high level, the orchestrator should determine when to send the request to which node. Also, the connector may give the orchestrator some feedback, such as "KV cache transfer finished" (depending on the implementation).

For more details, please refer to our design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk

Extra note

  • This PR's goal is shipping the connector API with just a minimal functional implementation. We are working on a better (more performant, more stable) implementation and that will be a new PR soon.

ApostaC avatar Apr 02 '25 18:04 ApostaC

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Apr 02 '25 18:04 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ApostaC.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 02 '25 18:04 mergify[bot]

cc @KuntaiDu @YaoJiayi

ApostaC avatar Apr 02 '25 19:04 ApostaC

@ApostaC I cherry-pick this PR to our repo and run the example. The following is my log

INFO 04-03 07:33:59 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:33:59 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={}
INFO 04-03 07:33:59 [shared_storage_connector.py:92] Shared storage path is /tmp
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|██████████████████████████| 4/4 [00:00<00:00, 50.53it/s, est. speed input: 38147.85 toks/s, output: 50.56 toks/s]
Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
[rank0]:[W403 07:34:00.128094287 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO 04-03 07:34:03 [__init__.py:239] Automatically detected platform cuda.
Loaded 4 prompts from output.txt
INFO 04-03 07:34:09 [config.py:591] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-03 07:34:10 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-03 07:34:10 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-03 07:34:10 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev711+gee96432c) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-03 07:34:11 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffe6832ea80>
INFO 04-03 07:34:11 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-03 07:34:11 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:11 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:11 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-03 07:34:11 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-03 07:34:11 [gpu_model_runner.py:1179] Starting to load model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct...
INFO 04-03 07:34:11 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.10it/s]

INFO 04-03 07:34:11 [loader.py:447] Loading weights took 0.50 seconds
INFO 04-03 07:34:12 [gpu_model_runner.py:1191] Model loading took 2.8871 GB and 0.585285 seconds
INFO 04-03 07:34:12 [kv_cache_utils.py:566] GPU KV cache size: 2,644,896 tokens
INFO 04-03 07:34:12 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 80.72x
INFO 04-03 07:34:12 [core.py:152] init engine (profile, create kv cache, warmup model) took 0.88 seconds
INFO 04-03 07:34:12 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:12 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:12 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|█████████████████████████| 4/4 [00:00<00:00, 16.85it/s, est. speed input: 12724.93 toks/s, output: 168.48 toks/s]
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'
[rank0]:[W403 07:34:13.321916884 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

I abstract the key log

Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
.....
....
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'

Why the first token of Generated text from decode instance is different from the first token generated from profill instance

maobaolong avatar Apr 03 '25 14:04 maobaolong

Why the first token of Generated text from decode instance is different from the first token generated from profill instance

@maobaolong In the example, the prefill instance first generates a new token and "sends" the context + the newly generated token together to the decode instance. The decoder then starts generating based on that. Therefore, if you look at the "first generated token" on prefill instance and decoder instance, they should be different.

Example:

  • prefill instance prompt: Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is
  • prefill instance generation: the
  • decoder instance prompt: Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is **the** (here, "the" is the first token generated on the prefill instance)
  • decoder generation: answer: "The answer is: "The answer

Hoping this answers your question.

ApostaC avatar Apr 03 '25 17:04 ApostaC

Should we just deprecate V0?

robertgshaw2-redhat avatar Apr 05 '25 00:04 robertgshaw2-redhat

Thank for for this PR 😃. Here some small changes proopsal to restore the support of V0 (broken by this PR).

@hasB4K Thanks for the catch! I'll update the code soon!

ApostaC avatar Apr 05 '25 02:04 ApostaC

Should we just deprecate V0?

Thanks for bringing this up @robertgshaw2-redhat ! I think we should still keep v0 before a performant v1 connector implementation is ready (we are working on that, and it should be ready next week).

Btw, also thanks for the comment. I will address them soon.

ApostaC avatar Apr 05 '25 02:04 ApostaC

@ApostaC There are two question from my side.

  1. Is those blocks possible to be evicted after Scheduler get_external_prefix_cache_blocks ? If so, how to handle the computed block missing in worker process?
  2. Shall we avoid kv connector access while profile run?

The following is the call stack while scheduler _initialize_kv_caches and worker profile_run


INFO 04-03 17:24:58 [loader.py:447] Loading weights took 0.49 seconds
INFO 04-03 17:24:58 [gpu_model_runner.py:1191] Model loading took 2.8871 GB and 0.574087 seconds
  File "/disc/data1/baoloongmao/v1connector/prefill_example.py", line 17, in <module>
    llm = LLM(model="/disc/data1/Qwen/Qwen2.5-1.5B-Instruct",
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1037, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 245, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 139, in from_engine_args
    return cls(vllm_config=vllm_config,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 88, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 68, in make_client
    return InprocClient(vllm_config, executor_class, log_stats)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 164, in __init__
    self.engine_core = EngineCore(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 63, in __init__
    num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 123, in _initialize_kv_caches
    available_gpu_memory = self.model_executor.determine_available_memory()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2255, in run_method
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 158, in determine_available_memory
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1504, in profile_run
    hidden_states = self._dummy_run(self.max_num_tokens)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1341, in _dummy_run
    hidden_states = model(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 462, in forward
    hidden_states = self.model(input_ids, positions, intermediate_tensors,
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
    return self.forward(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 338, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 243, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 177, in forward
    attn_output = self.attn(q, k, v)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 184, in forward
    get_kv_transfer_group().wait_for_layer_load(self.layer_name)
  File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py", line 175, in wait_for_layer_load
    import traceback; traceback.print_stack()

maobaolong avatar Apr 06 '25 01:04 maobaolong

@hasB4K @robertgshaw2-redhat Hey, I just pushed some new updates to address the review comments. Feel free to take a look and let me know if it does not resolve your concerns. Thanks!

ApostaC avatar Apr 06 '25 19:04 ApostaC

@hasB4K @maobaolong About the memory safety / memory leaking issue: currently, the implementation about this is pretty hacky. I will spend some time to check whether there could any problems.

Also, if you guys have encountered any memory problems, please let me know. This will be very very helpful!

ApostaC avatar Apr 06 '25 19:04 ApostaC

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ApostaC.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 07 '25 07:04 mergify[bot]

@yihua I ran DeepSeek-V2-lite with VLLM_MLA_DISABLE=1 and Qwen model for this PR, it can run successfully. 👍

The following is our test log.

root@TENCENT64:/disc/data1/baoloongmao/v1connector# bash run.sh 
INFO 04-07 06:05:17 [__init__.py:239] Automatically detected platform cuda.
INFO 04-07 06:05:18 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-07 06:05:23 [config.py:591] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 04-07 06:05:24 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-07 06:05:24 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-07 06:05:24 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev713+gb4961126) with config: model='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-07 06:05:25 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f775c2ace30>
INFO 04-07 06:05:26 [parallel_state.py:983] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-07 06:05:26 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:26 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:26 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-07 06:05:26 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-07 06:05:26 [gpu_model_runner.py:1179] Starting to load model /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/...
INFO 04-07 06:05:26 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.23it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.17it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 04-07 06:05:29 [loader.py:447] Loading weights took 3.13 seconds
INFO 04-07 06:05:29 [gpu_model_runner.py:1191] Model loading took 29.3011 GB and 3.279029 seconds
WARNING 04-07 06:05:30 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H20.json
WARNING 04-07 06:05:30 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=2048,device_name=NVIDIA_H20.json
INFO 04-07 06:05:31 [kv_cache_utils.py:566] GPU KV cache size: 135,264 tokens
INFO 04-07 06:05:31 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 4.13x
INFO 04-07 06:05:32 [core.py:148] init engine (profile, create kv cache, warmup model) took 2.44 seconds
INFO 04-07 06:05:32 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:32 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:32 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts: 100%|████████████████████████████| 4/4 [00:00<00:00,  4.98it/s, est. speed input: 3763.96 toks/s, output: 4.98 toks/s]
Prompt: 'Hi Hi Hi Hi Hi Hi ............ Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is', Generated text: ' ['
Prompt: 'Hi Hi Hi Hi Hi Hi ........... Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi The capital of France is', Generated text: ' Paris'
Prompt: 'Hey Hey Hey Hey ............. Hey Hey Hey Hey Your name is', Generated text: ' not'
Prompt: 'Hey Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' Beijing'
Saved 4 prompts to output.txt
[rank0]:[W407 06:05:33.694118251 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO 04-07 06:05:36 [__init__.py:239] Automatically detected platform cuda.
Loaded 4 prompts from output.txt
INFO 04-07 06:05:38 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-07 06:05:43 [config.py:591] This model supports multiple tasks: {'classify', 'score', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 04-07 06:05:43 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-07 06:05:43 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-07 06:05:43 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev713+gb4961126) with config: model='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-07 06:05:44 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb71895cc50>
INFO 04-07 06:05:45 [parallel_state.py:983] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-07 06:05:45 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:45 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:45 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-07 06:05:45 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-07 06:05:45 [gpu_model_runner.py:1179] Starting to load model /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/...
INFO 04-07 06:05:45 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.25it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.18it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.31it/s]

INFO 04-07 06:05:48 [loader.py:447] Loading weights took 3.10 seconds
INFO 04-07 06:05:48 [gpu_model_runner.py:1191] Model loading took 29.3011 GB and 3.221119 seconds
WARNING 04-07 06:05:49 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H20.json
WARNING 04-07 06:05:49 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=2048,device_name=NVIDIA_H20.json
INFO 04-07 06:05:50 [kv_cache_utils.py:566] GPU KV cache size: 135,264 tokens
INFO 04-07 06:05:50 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 4.13x
INFO 04-07 06:05:51 [core.py:148] init engine (profile, create kv cache, warmup model) took 2.58 seconds
INFO 04-07 06:05:51 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:51 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:51 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts:   0%|                                       | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:152] Inject KV cache of 992 tokens to the paged memory
INFO 04-07 06:05:51 [shared_storage_connector.py:152] Inject KV cache of 496 tokens to the paged memory
INFO 04-07 06:05:51 [shared_storage_connector.py:152] Inject KV cache of 496 tokens to the paged memory
Processed prompts: 100%|███████████████████████████| 4/4 [00:00<00:00,  9.71it/s, est. speed input: 7347.90 toks/s, output: 84.99 toks/s]
Prompt: 'Hi Hi Hi Hi Hi........Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is [', Generated text: 'Your Name] and I am a [Your Job'
Prompt: 'Hi Hi Hi Hi.......... Hi Hi Hi Hi Hi Hi Hi The capital of France is Paris', Generated text: '. The city of Paris is located along the Seine'
Prompt: 'Hey Hey Hey ........ Hey Hey Your name is not', Generated text: ' on the list.'
Prompt: 'Hey Hey Hey .........Hey The capital of China is Beijing', Generated text: '.\n\nThe capital of China is Beijing.'
[rank0]:[W407 06:05:52.480797748 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

maobaolong avatar Apr 07 '25 13:04 maobaolong

LGTM.

robertgshaw2-redhat avatar Apr 08 '25 03:04 robertgshaw2-redhat

@ApostaC Do you think it would be feasible to include an simple online server example in this PR to demonstrate how orchestrator would interact with KVConnector?

VertexC avatar Apr 09 '25 08:04 VertexC

@ApostaC Do you think it would be feasible to include an simple online server example in this PR to demonstrate how orchestrator would interact with KVConnector?

@VertexC Yeah, it would be pretty helpful. I think it could probably be in a separate PR, as this one is already pretty large.

ApostaC avatar Apr 09 '25 17:04 ApostaC

Hello! Since it wasn't too much work I added the MLA support in this PR. Hope this helps, the changes are pretty light (mainly doing if MLA [correct shape indices]) so I hope it doesn't slow down this PR.

Flechman avatar Apr 10 '25 19:04 Flechman

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ApostaC.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 13 '25 02:04 mergify[bot]

image

@robertgshaw2-redhat Thanks for the update today, just sync to you that I've test the last state 80 commits, there are assertion failed in the block_pool.py

maobaolong avatar Apr 14 '25 12:04 maobaolong

image @robertgshaw2-redhat Thanks for the update today, just sync to you that I've test the last state `80 commits`, there are `assertion failed` in the block_pool.py image

Thanks, I am aware

robertgshaw2-redhat avatar Apr 14 '25 12:04 robertgshaw2-redhat

@robertgshaw2-redhat I guess the reason is here image

maobaolong avatar Apr 14 '25 12:04 maobaolong

@robertgshaw2-redhat I guess the reason is here

This should be resolved now

robertgshaw2-redhat avatar Apr 15 '25 03:04 robertgshaw2-redhat

There is one small edge case left:

Specifically, the case where:

  • a request is preempted
  • the request is rescheduled, it get a remote cache hit during recomputation
  • the connector calls from_request(request: NewRequestData) when making the metadata, however in this case we do not have NewRequestData, we have CachedRequestData. this is a type mismatch

Have a test for this case. Will fix in the AM

@WoosukKwon - otherwise this should be good

robertgshaw2-redhat avatar Apr 15 '25 05:04 robertgshaw2-redhat

@robertgshaw2-redhat As discussed offline, I'm ok with merging this PR. However, I'd like to defer any other followup PRs (such as #16625) until we land the hybrid memory allocator, since there will be substantial changes in the KV cache manager.

WoosukKwon avatar Apr 16 '25 06:04 WoosukKwon

May I ask when the branch will be merged?

sunshenao avatar Apr 17 '25 05:04 sunshenao

May I ask when the branch will be merged?

Just getting test green

robertgshaw2-redhat avatar Apr 17 '25 15:04 robertgshaw2-redhat

But is there a demo to run? Can I run like this?

VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --model xxx--port 8500 --max-model-len 8192 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' --trust-remote-code --tensor-parallel-size 8

Huixxi avatar Apr 23 '25 12:04 Huixxi

But is there a demo to run? Can I run like this?

@Huixxi There is an example in #16625

ApostaC avatar Apr 23 '25 17:04 ApostaC

But is there a demo to run? Can I run like this?

@Huixxi There is an example in #16625

Thanks! Which branch of source code should I better to use? https://github.com/ApostaC/vllm/tree/local-dev/lmcache-v1-connector-pr this one? And does it support xPyD now? Multi nodes? And how to install which version of lmcache?

Huixxi avatar Apr 24 '25 05:04 Huixxi

How do run this with vLLM serve? python3 -m vllm.entrypoints.openai.api_server --model /extended/downloaded/Meta-Llama-3.1-70B-Instruct-quantized.w8a8/ --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}' --max-model-len 4096 --gpu-memory-utilization 0.9 Gives error

llm.v1.worker.gpu_worker.Worker object at 0xf17daec353d0>
INFO 04-28 18:01:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
ERROR 04-28 18:01:26 [core.py:396] EngineCore failed to start.
ERROR 04-28 18:01:26 [core.py:396] Traceback (most recent call last):
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 04-28 18:01:26 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
ERROR 04-28 18:01:26 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
ERROR 04-28 18:01:26 [core.py:396]     self.model_executor = executor_class(vllm_config)
ERROR 04-28 18:01:26 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-28 18:01:26 [core.py:396]     self._init_executor()
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-28 18:01:26 [core.py:396]     self.collective_rpc("init_device")
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-28 18:01:26 [core.py:396]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-28 18:01:26 [core.py:396]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
ERROR 04-28 18:01:26 [core.py:396]     return func(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-28 18:01:26 [core.py:396]     self.worker.init_device()  # type: ignore
ERROR 04-28 18:01:26 [core.py:396]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
ERROR 04-28 18:01:26 [core.py:396]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
ERROR 04-28 18:01:26 [core.py:396]     ensure_kv_transfer_initialized(vllm_config)
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
ERROR 04-28 18:01:26 [core.py:396]     _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
ERROR 04-28 18:01:26 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396]   File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
ERROR 04-28 18:01:26 [core.py:396]     assert issubclass(connector_cls, KVConnectorBase_V1)
ERROR 04-28 18:01:26 [core.py:396]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] AssertionError
Process EngineCore_0:
Traceback (most recent call last):
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/vllm/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
    self.collective_rpc("init_device")
  File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
    self.worker.init_device()  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
    init_worker_distributed_environment(self.vllm_config, self.rank,
  File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
    ensure_kv_transfer_initialized(vllm_config)
  File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
    _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
    assert issubclass(connector_cls, KVConnectorBase_V1)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[rank0]:[W428 18:01:26.726953847 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/workspace/vllm/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

khayamgondal avatar Apr 28 '25 18:04 khayamgondal