[Usage]: How to deploy Mooncake in a multi-machine multi-GPU environment?
Describe your usage question
How to deploy Mooncake in a multi-machine multi-GPU environment?
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues and read the documentation
https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store.md#mooncake-store-python-api
Another example is deployed with hicache and sglang. https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md
@stmatengss I haven't found any Ray deployment methods—could you recommend some?
1.#Start the mooncake main service (default port 50051) and the HTTP metadata service (default port 8080)
(MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat Mooncake_master.sh
mooncake_master
--rpc_port 50051
--metrics_port 9003
--enable_metric_reporting 1
--enable_http_metadata_server 1
--http_metadata_server_port 8080
--http_metadata_server_host 0.0.0.0
--root_fs_dir=/workspaces/zhangjh/mooncake_data
--cluster_id=mooncake_cluster
Expected return
(MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# sh Mooncake_master.sh WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 15:44:10.975198 347901 master.cpp:362] Master service started on port 50051, max_threads=4, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.95, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=4, rpc_port=50051, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, rpc protocol=tcp, cluster_id=mooncake_cluster, root_fs_dir=/mooncake, memory_allocator=offset, enable_http_metadata_server=1, http_metadata_server_port=8080, http_metadata_server_host=0.0.0.0 I1017 15:44:10.975406 347901 master.cpp:300] Starting C++ HTTP metadata server on 0.0.0.0:8080 I1017 15:44:10.976037 347901 http_metadata_server.cpp:108] HTTP metadata server started on 0.0.0.0:8080 I1017 15:44:10.976052 347901 master.cpp:309] C++ HTTP metadata server started successfully I1017 15:44:12.011040 347901 rpc_service.cpp:172] HTTP metrics server started on port 9003 I1017 15:44:12.011780 347922 rpc_service.cpp:40] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B
2.# config: mooncake-config.yaml
chunk_size: 256 remote_url: "mooncakestore://localhost:50051/" remote_serde: "naive" local_cpu: True max_local_cpu_size: 20 extra_config: local_hostname: "192.168.255.80" metadata_server: "http://localhost:8080/metadata" protocol: "rdma" device_name: "mlx5_0" master_server_address: "localhost:50051" global_segment_size: 12884901888 local_buffer_size: 2147483648 eviction_high_watermark_ratio: 0.9 eviction_ratio: 0.1 transfer_timeout: 10
- start vllm + mooncake (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat MooncakeServe.sh timestamp=$(date +"%Y%m%d_%H%M%S")
PYTHONHASHSEED="1"
LMCACHE_USE_EXPERIMENTAL=True
LMCACHE_CONFIG_FILE="mooncake-config.yaml"
VLLM_USE_MODELSCOPE=True
CUDA_VISIBLE_DEVICES=2,3
vllm serve /workspaces/modelscope-yrcache/modelscope/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
--max-model-len 24576
--gpu-memory-utilization 0.9
--port 8202
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
--served-model-name Qwen-32B
--tensor-parallel-size 2 2>&1 | tee /workspaces/zhangjh/mooncake_test/Qwen-32B/logs/Qwen-32B-MooncakeServe_$timestamp.log
This is how I started it—on a single machine with a single GPU.
@stmatengss
hello ,i think you can read this vLLM V1 Disaggregated Serving with Mooncake Store and LMCache to deploy mooncake for each P/D instance. @txh1873749380
1.#Start the mooncake main service (default port 50051) and the HTTP metadata service (default port 8080) (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat Mooncake_master.sh mooncake_master --rpc_port 50051 --metrics_port 9003 --enable_metric_reporting 1 --enable_http_metadata_server 1 --http_metadata_server_port 8080 --http_metadata_server_host 0.0.0.0 --root_fs_dir=/workspaces/zhangjh/mooncake_data --cluster_id=mooncake_cluster
Expected return
(MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# sh Mooncake_master.sh WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 15:44:10.975198 347901 master.cpp:362] Master service started on port 50051, max_threads=4, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.95, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=4, rpc_port=50051, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, rpc protocol=tcp, cluster_id=mooncake_cluster, root_fs_dir=/mooncake, memory_allocator=offset, enable_http_metadata_server=1, http_metadata_server_port=8080, http_metadata_server_host=0.0.0.0 I1017 15:44:10.975406 347901 master.cpp:300] Starting C++ HTTP metadata server on 0.0.0.0:8080 I1017 15:44:10.976037 347901 http_metadata_server.cpp:108] HTTP metadata server started on 0.0.0.0:8080 I1017 15:44:10.976052 347901 master.cpp:309] C++ HTTP metadata server started successfully I1017 15:44:12.011040 347901 rpc_service.cpp:172] HTTP metrics server started on port 9003 I1017 15:44:12.011780 347922 rpc_service.cpp:40] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B
2.# config: mooncake-config.yaml
chunk_size: 256 remote_url: "mooncakestore://localhost:50051/" remote_serde: "naive" local_cpu: True max_local_cpu_size: 20 extra_config: local_hostname: "192.168.255.80" metadata_server: "http://localhost:8080/metadata" protocol: "rdma" device_name: "mlx5_0" master_server_address: "localhost:50051" global_segment_size: 12884901888 local_buffer_size: 2147483648 eviction_high_watermark_ratio: 0.9 eviction_ratio: 0.1 transfer_timeout: 10
- start vllm + mooncake (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat MooncakeServe.sh timestamp=$(date +"%Y%m%d_%H%M%S")
PYTHONHASHSEED="1" LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE="mooncake-config.yaml" VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=2,3 vllm serve /workspaces/modelscope-yrcache/modelscope/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 24576 --gpu-memory-utilization 0.9 --port 8202 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' --served-model-name Qwen-32B --tensor-parallel-size 2 2>&1 | tee /workspaces/zhangjh/mooncake_test/Qwen-32B/logs/Qwen-32B-MooncakeServe_$timestamp.log
This is how I started it—on a single machine with a single GPU.
You can run multiple vllm instances simultaneously, with each colocating with a Mooncake client. These Mooncake clients can share the same master.
@stmatengss What you mean is to start two Mooncake services and keep the configurations the same, except that the root_fs_dir and cluster_id are different, or do you mean we should just start one Mooncake service? Because my understanding is that both machines should have data, but when starting with Ray, only one Mooncake node is found with data on its disk.
@stmatengss
@Keithwwa