vllm
vllm copied to clipboard
[P/D][V1] KV Connector API V1
TL;DR:
This PR opens the KV connector API in v1 to support disaggregated prefill. It also includes a minimal functional implementation as an example of how to use the connector API.
Detailed design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk
This PR is co-authored by:
- KuntaiDu [email protected]
- YaoJiayi [email protected]
TODOs in the upcoming PRs
- [ ] More performant connector implementation using P2P connections
- [ ] MLA support
- [ ] Enable KVCacheManager allocating temporary blocks for the connector to use
Key design choices
- Implement disagg prefill under the hood of v1's prefix caching and chunked prefill semantics The vLLM scheduler calculates which set of tokens needs a KV store or KV load, and the workers perform the actual KV store or load operations.
- Provide layer-wise async API support
- KV cache prefetching and request orchestration should happen outside vLLM so that the changes in the core can be minimized
High-level design of the KV connector in v1
The figure below shows the high-level design of the connector
In the design, every process in vLLM will have a corresponding connector. Specifically, we have
- Scheduler connector: the connector that locates in the same process as the scheduler process. It schedules the KV cache transfer ops.
- Worker connectors: the connectors that locate in the worker processes. They execute KV cache transfer ops.
Scheduler connector
On prefill nodes, the scheduler connector needs to parse the scheduler's output and determine what tokens should have their KV cache transmitted to the decoder nodes.
On decoder nodes, the scheduler connector needs to return the "correct" num_computed_tokens and computed_blocks when calling get_computed_tokens.
Worker connector
The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load:
Working with outside orchestrator
In more advanced use cases like xPyD, the connector may need to know which decoder node to send the KV cache to from the outside orchestrator. We believe different infrastructure providers may have very different orchestrating logics, and thus such logic should reside outside of vLLM.
The figure below explains the workflow among the orchestrator, vLLM, and the connector:
At a high level, the orchestrator should determine when to send the request to which node. Also, the connector may give the orchestrator some feedback, such as "KV cache transfer finished" (depending on the implementation).
For more details, please refer to our design doc: https://docs.google.com/document/d/1uPGdbEXksKXeN4Q9nUm9hzotqEjQhYmnpAhidLuAsjk
Extra note
- This PR's goal is shipping the connector API with just a minimal functional implementation. We are working on a better (more performant, more stable) implementation and that will be a new PR soon.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ApostaC.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
cc @KuntaiDu @YaoJiayi
@ApostaC I cherry-pick this PR to our repo and run the example. The following is my log
INFO 04-03 07:33:59 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:33:59 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={}
INFO 04-03 07:33:59 [shared_storage_connector.py:92] Shared storage path is /tmp
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:250] Hit the cache! Allocate new blocks!
INFO 04-03 07:34:00 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|██████████████████████████| 4/4 [00:00<00:00, 50.53it/s, est. speed input: 38147.85 toks/s, output: 50.56 toks/s]
Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
[rank0]:[W403 07:34:00.128094287 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO 04-03 07:34:03 [__init__.py:239] Automatically detected platform cuda.
Loaded 4 prompts from output.txt
INFO 04-03 07:34:09 [config.py:591] This model supports multiple tasks: {'embed', 'generate', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-03 07:34:10 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-03 07:34:10 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-03 07:34:10 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev711+gee96432c) with config: model='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='/disc/data1/Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-03 07:34:11 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7ffe6832ea80>
INFO 04-03 07:34:11 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-03 07:34:11 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:11 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:11 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-03 07:34:11 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-03 07:34:11 [gpu_model_runner.py:1179] Starting to load model /disc/data1/Qwen/Qwen2.5-1.5B-Instruct...
INFO 04-03 07:34:11 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.10it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.10it/s]
INFO 04-03 07:34:11 [loader.py:447] Loading weights took 0.50 seconds
INFO 04-03 07:34:12 [gpu_model_runner.py:1191] Model loading took 2.8871 GB and 0.585285 seconds
INFO 04-03 07:34:12 [kv_cache_utils.py:566] GPU KV cache size: 2,644,896 tokens
INFO 04-03 07:34:12 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 80.72x
INFO 04-03 07:34:12 [core.py:152] init engine (profile, create kv cache, warmup model) took 0.88 seconds
INFO 04-03 07:34:12 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-03 07:34:12 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-03 07:34:12 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
INFO 04-03 07:34:13 [shared_storage_connector.py:131] Start loading KV cache from the connector
Processed prompts: 100%|█████████████████████████| 4/4 [00:00<00:00, 16.85it/s, est. speed input: 12724.93 toks/s, output: 168.48 toks/s]
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'
[rank0]:[W403 07:34:13.321916884 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
I abstract the key log
Prompt: 'Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is', Generated text: ' the'
Prompt: 'Hi Hi Hi Hi Hi ................i Hi Hi Hi The capital of France is', Generated text: ' the'
Prompt: 'Hey Hey..................Hey Your name is', Generated text: ' the'
Prompt: 'Hey Hey Hey ............... Hey Hey The capital of China is', Generated text: ' '
Saved 4 prompts to output.txt
.....
....
Prompt: 'Hi Hi Hi Hi Hi......................Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is the', Generated text: ' answer: "The answer is: "The answer'
Prompt: 'Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi............. Hi Hi Hi Hi Hi The capital of France is the', Generated text: ' first step. 1. 202'
Prompt: 'Hey Hey Hey Hey ......... Hey Hey Hey Hey Hey Your name is the', Generated text: ' best way to find the best way to find the'
Prompt: 'Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' 100000000'
Why the first token of Generated text from decode instance is different from the first token generated from profill instance
Why the first token of Generated text from decode instance is different from the first token generated from profill instance
@maobaolong In the example, the prefill instance first generates a new token and "sends" the context + the newly generated token together to the decode instance. The decoder then starts generating based on that. Therefore, if you look at the "first generated token" on prefill instance and decoder instance, they should be different.
Example:
- prefill instance prompt:
Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is - prefill instance generation:
the - decoder instance prompt:
Hi Hi Hi Hi Hi H...................Hi Hi Hello, my name is **the**(here, "the" is the first token generated on the prefill instance) - decoder generation:
answer: "The answer is: "The answer
Hoping this answers your question.
Should we just deprecate V0?
Thank for for this PR 😃. Here some small changes proopsal to restore the support of V0 (broken by this PR).
@hasB4K Thanks for the catch! I'll update the code soon!
Should we just deprecate V0?
Thanks for bringing this up @robertgshaw2-redhat ! I think we should still keep v0 before a performant v1 connector implementation is ready (we are working on that, and it should be ready next week).
Btw, also thanks for the comment. I will address them soon.
@ApostaC There are two question from my side.
- Is those blocks possible to be evicted after Scheduler
get_external_prefix_cache_blocks? If so, how to handle the computed block missing in worker process? - Shall we avoid kv connector access while
profile run?
The following is the call stack while scheduler _initialize_kv_caches and worker profile_run
INFO 04-03 17:24:58 [loader.py:447] Loading weights took 0.49 seconds
INFO 04-03 17:24:58 [gpu_model_runner.py:1191] Model loading took 2.8871 GB and 0.574087 seconds
File "/disc/data1/baoloongmao/v1connector/prefill_example.py", line 17, in <module>
llm = LLM(model="/disc/data1/Qwen/Qwen2.5-1.5B-Instruct",
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1037, in inner
return fn(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 245, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 139, in from_engine_args
return cls(vllm_config=vllm_config,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 88, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 68, in make_client
return InprocClient(vllm_config, executor_class, log_stats)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 164, in __init__
self.engine_core = EngineCore(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 63, in __init__
num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 123, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2255, in run_method
return func(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 158, in determine_available_memory
self.model_runner.profile_run()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1504, in profile_run
hidden_states = self._dummy_run(self.max_num_tokens)
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1341, in _dummy_run
hidden_states = model(
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 462, in forward
hidden_states = self.model(input_ids, positions, intermediate_tensors,
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
return self.forward(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 338, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 243, in forward
hidden_states = self.self_attn(
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 177, in forward
attn_output = self.attn(q, k, v)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 184, in forward
get_kv_transfer_group().wait_for_layer_load(self.layer_name)
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/kv_transfer/kv_connector/v1/shared_storage_connector.py", line 175, in wait_for_layer_load
import traceback; traceback.print_stack()
@hasB4K @robertgshaw2-redhat Hey, I just pushed some new updates to address the review comments. Feel free to take a look and let me know if it does not resolve your concerns. Thanks!
@hasB4K @maobaolong About the memory safety / memory leaking issue: currently, the implementation about this is pretty hacky. I will spend some time to check whether there could any problems.
Also, if you guys have encountered any memory problems, please let me know. This will be very very helpful!
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ApostaC.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@yihua I ran DeepSeek-V2-lite with VLLM_MLA_DISABLE=1 and Qwen model for this PR, it can run successfully. 👍
The following is our test log.
root@TENCENT64:/disc/data1/baoloongmao/v1connector# bash run.sh
INFO 04-07 06:05:17 [__init__.py:239] Automatically detected platform cuda.
INFO 04-07 06:05:18 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-07 06:05:23 [config.py:591] This model supports multiple tasks: {'score', 'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 04-07 06:05:24 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-07 06:05:24 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-07 06:05:24 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev713+gb4961126) with config: model='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-07 06:05:25 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f775c2ace30>
INFO 04-07 06:05:26 [parallel_state.py:983] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-07 06:05:26 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:26 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:26 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-07 06:05:26 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-07 06:05:26 [gpu_model_runner.py:1179] Starting to load model /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/...
INFO 04-07 06:05:26 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.23it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.17it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.29it/s]
INFO 04-07 06:05:29 [loader.py:447] Loading weights took 3.13 seconds
INFO 04-07 06:05:29 [gpu_model_runner.py:1191] Model loading took 29.3011 GB and 3.279029 seconds
WARNING 04-07 06:05:30 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H20.json
WARNING 04-07 06:05:30 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=2048,device_name=NVIDIA_H20.json
INFO 04-07 06:05:31 [kv_cache_utils.py:566] GPU KV cache size: 135,264 tokens
INFO 04-07 06:05:31 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 4.13x
INFO 04-07 06:05:32 [core.py:148] init engine (profile, create kv cache, warmup model) took 2.44 seconds
INFO 04-07 06:05:32 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:32 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:32 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts: 100%|████████████████████████████| 4/4 [00:00<00:00, 4.98it/s, est. speed input: 3763.96 toks/s, output: 4.98 toks/s]
Prompt: 'Hi Hi Hi Hi Hi Hi ............ Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is', Generated text: ' ['
Prompt: 'Hi Hi Hi Hi Hi Hi ........... Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi The capital of France is', Generated text: ' Paris'
Prompt: 'Hey Hey Hey Hey ............. Hey Hey Hey Hey Your name is', Generated text: ' not'
Prompt: 'Hey Hey Hey Hey ............. Hey Hey Hey The capital of China is', Generated text: ' Beijing'
Saved 4 prompts to output.txt
[rank0]:[W407 06:05:33.694118251 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO 04-07 06:05:36 [__init__.py:239] Automatically detected platform cuda.
Loaded 4 prompts from output.txt
INFO 04-07 06:05:38 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-07 06:05:43 [config.py:591] This model supports multiple tasks: {'classify', 'score', 'embed', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 04-07 06:05:43 [config.py:1712] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-07 06:05:43 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 04-07 06:05:43 [core.py:54] Initializing a V1 LLM engine (v0.7.4.dev713+gb4961126) with config: model='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/disc/data1/deepseek/DeepSeek-V2-Lite-Chat/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 04-07 06:05:44 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb71895cc50>
INFO 04-07 06:05:45 [parallel_state.py:983] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-07 06:05:45 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:45 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:45 [shared_storage_connector.py:92] Shared storage path is local_storage
INFO 04-07 06:05:45 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 04-07 06:05:45 [gpu_model_runner.py:1179] Starting to load model /disc/data1/deepseek/DeepSeek-V2-Lite-Chat/...
INFO 04-07 06:05:45 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.25it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.18it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.31it/s]
INFO 04-07 06:05:48 [loader.py:447] Loading weights took 3.10 seconds
INFO 04-07 06:05:48 [gpu_model_runner.py:1191] Model loading took 29.3011 GB and 3.221119 seconds
WARNING 04-07 06:05:49 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H20.json
WARNING 04-07 06:05:49 [fused_moe.py:962] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=2048,device_name=NVIDIA_H20.json
INFO 04-07 06:05:50 [kv_cache_utils.py:566] GPU KV cache size: 135,264 tokens
INFO 04-07 06:05:50 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 4.13x
INFO 04-07 06:05:51 [core.py:148] init engine (profile, create kv cache, warmup model) took 2.58 seconds
INFO 04-07 06:05:51 [factory.py:61] Creating v1 connector with name: SharedStorageConnector
INFO 04-07 06:05:51 [shared_storage_connector.py:91] kv_connector='SharedStorageConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={'shared_storage_path': 'local_storage'}
INFO 04-07 06:05:51 [shared_storage_connector.py:92] Shared storage path is local_storage
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:251] Hit the cache! Allocate new blocks!
INFO 04-07 06:05:51 [shared_storage_connector.py:152] Inject KV cache of 992 tokens to the paged memory
INFO 04-07 06:05:51 [shared_storage_connector.py:152] Inject KV cache of 496 tokens to the paged memory
INFO 04-07 06:05:51 [shared_storage_connector.py:152] Inject KV cache of 496 tokens to the paged memory
Processed prompts: 100%|███████████████████████████| 4/4 [00:00<00:00, 9.71it/s, est. speed input: 7347.90 toks/s, output: 84.99 toks/s]
Prompt: 'Hi Hi Hi Hi Hi........Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hi Hello, my name is [', Generated text: 'Your Name] and I am a [Your Job'
Prompt: 'Hi Hi Hi Hi.......... Hi Hi Hi Hi Hi Hi Hi The capital of France is Paris', Generated text: '. The city of Paris is located along the Seine'
Prompt: 'Hey Hey Hey ........ Hey Hey Your name is not', Generated text: ' on the list.'
Prompt: 'Hey Hey Hey .........Hey The capital of China is Beijing', Generated text: '.\n\nThe capital of China is Beijing.'
[rank0]:[W407 06:05:52.480797748 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
LGTM.
@ApostaC Do you think it would be feasible to include an simple online server example in this PR to demonstrate how orchestrator would interact with KVConnector?
@ApostaC Do you think it would be feasible to include an simple online server example in this PR to demonstrate how orchestrator would interact with KVConnector?
@VertexC Yeah, it would be pretty helpful. I think it could probably be in a separate PR, as this one is already pretty large.
Hello! Since it wasn't too much work I added the MLA support in this PR. Hope this helps, the changes are pretty light (mainly doing if MLA [correct shape indices]) so I hope it doesn't slow down this PR.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @ApostaC.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@robertgshaw2-redhat Thanks for the update today, just sync to you that I've test the last state 80 commits, there are assertion failed in the block_pool.py
@robertgshaw2-redhat Thanks for the update today, just sync to you that I've test the last state `80 commits`, there are `assertion failed` in the block_pool.py
![]()
Thanks, I am aware
@robertgshaw2-redhat I guess the reason is here
@robertgshaw2-redhat I guess the reason is here
This should be resolved now
There is one small edge case left:
Specifically, the case where:
- a request is preempted
- the request is rescheduled, it get a remote cache hit during recomputation
- the connector calls from_request(request: NewRequestData) when making the metadata, however in this case we do not have NewRequestData, we have CachedRequestData. this is a type mismatch
Have a test for this case. Will fix in the AM
@WoosukKwon - otherwise this should be good
@robertgshaw2-redhat As discussed offline, I'm ok with merging this PR. However, I'd like to defer any other followup PRs (such as #16625) until we land the hybrid memory allocator, since there will be substantial changes in the KV cache manager.
May I ask when the branch will be merged?
May I ask when the branch will be merged?
Just getting test green
But is there a demo to run? Can I run like this?
VLLM_USE_V1=1 python3 -m vllm.entrypoints.openai.api_server --model xxx--port 8500 --max-model-len 8192 --gpu-memory-utilization 0.9 --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_consumer"}' --trust-remote-code --tensor-parallel-size 8
But is there a demo to run? Can I run like this?
@Huixxi There is an example in #16625
But is there a demo to run? Can I run like this?
@Huixxi There is an example in #16625
Thanks! Which branch of source code should I better to use? https://github.com/ApostaC/vllm/tree/local-dev/lmcache-v1-connector-pr this one? And does it support xPyD now? Multi nodes? And how to install which version of lmcache?
How do run this with vLLM serve?
python3 -m vllm.entrypoints.openai.api_server --model /extended/downloaded/Meta-Llama-3.1-70B-Instruct-quantized.w8a8/ --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}' --max-model-len 4096 --gpu-memory-utilization 0.9
Gives error
llm.v1.worker.gpu_worker.Worker object at 0xf17daec353d0>
INFO 04-28 18:01:26 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
ERROR 04-28 18:01:26 [core.py:396] EngineCore failed to start.
ERROR 04-28 18:01:26 [core.py:396] Traceback (most recent call last):
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 04-28 18:01:26 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
ERROR 04-28 18:01:26 [core.py:396] super().__init__(vllm_config, executor_class, log_stats,
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
ERROR 04-28 18:01:26 [core.py:396] self.model_executor = executor_class(vllm_config)
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-28 18:01:26 [core.py:396] self._init_executor()
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-28 18:01:26 [core.py:396] self.collective_rpc("init_device")
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-28 18:01:26 [core.py:396] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
ERROR 04-28 18:01:26 [core.py:396] return func(*args, **kwargs)
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-28 18:01:26 [core.py:396] self.worker.init_device() # type: ignore
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
ERROR 04-28 18:01:26 [core.py:396] init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
ERROR 04-28 18:01:26 [core.py:396] ensure_kv_transfer_initialized(vllm_config)
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
ERROR 04-28 18:01:26 [core.py:396] _KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
ERROR 04-28 18:01:26 [core.py:396] assert issubclass(connector_cls, KVConnectorBase_V1)
ERROR 04-28 18:01:26 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-28 18:01:26 [core.py:396] AssertionError
Process EngineCore_0:
Traceback (most recent call last):
File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/vllm/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/workspace/vllm/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/v1/engine/core.py", line 329, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/workspace/vllm/vllm/v1/engine/core.py", line 64, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 46, in _init_executor
self.collective_rpc("init_device")
File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
self.worker.init_device() # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 135, in init_device
init_worker_distributed_environment(self.vllm_config, self.rank,
File "/workspace/vllm/vllm/v1/worker/gpu_worker.py", line 329, in init_worker_distributed_environment
ensure_kv_transfer_initialized(vllm_config)
File "/workspace/vllm/vllm/distributed/kv_transfer/kv_transfer_state.py", line 63, in ensure_kv_transfer_initialized
_KV_CONNECTOR_AGENT = KVConnectorFactory.create_connector_v1(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/factory.py", line 63, in create_connector_v1
assert issubclass(connector_cls, KVConnectorBase_V1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[rank0]:[W428 18:01:26.726953847 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
uvloop.run(run_server(args))
File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/workspace/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.12.9-linux-aarch64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
return cls(
^^^^
File "/workspace/vllm/vllm/v1/engine/async_llm.py", line 118, in __init__
self.engine_core = core_client_class(
^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/v1/engine/core_client.py", line 642, in __init__
super().__init__(
File "/workspace/vllm/vllm/v1/engine/core_client.py", line 398, in __init__
self._wait_for_engine_startup()
File "/workspace/vllm/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
@robertgshaw2-redhat Thanks for the update today, just sync to you that I've test the last state `80 commits`, there are `assertion failed` in the block_pool.py