Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[Usage]: How to deploy Mooncake in a multi-machine multi-GPU environment?

Open txh1873749380 opened this issue 1 month ago • 10 comments

Describe your usage question

How to deploy Mooncake in a multi-machine multi-GPU environment?

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

txh1873749380 avatar Nov 26 '25 07:11 txh1873749380

https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store.md#mooncake-store-python-api

stmatengss avatar Nov 26 '25 14:11 stmatengss

Another example is deployed with hicache and sglang. https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md

stmatengss avatar Nov 26 '25 14:11 stmatengss

@stmatengss I haven't found any Ray deployment methods—could you recommend some?

txh1873749380 avatar Nov 26 '25 14:11 txh1873749380

1.#Start the mooncake main service (default port 50051) and the HTTP metadata service (default port 8080) (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat Mooncake_master.sh mooncake_master
--rpc_port 50051
--metrics_port 9003
--enable_metric_reporting 1
--enable_http_metadata_server 1
--http_metadata_server_port 8080
--http_metadata_server_host 0.0.0.0
--root_fs_dir=/workspaces/zhangjh/mooncake_data
--cluster_id=mooncake_cluster

Expected return

(MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# sh Mooncake_master.sh WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 15:44:10.975198 347901 master.cpp:362] Master service started on port 50051, max_threads=4, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.95, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=4, rpc_port=50051, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, rpc protocol=tcp, cluster_id=mooncake_cluster, root_fs_dir=/mooncake, memory_allocator=offset, enable_http_metadata_server=1, http_metadata_server_port=8080, http_metadata_server_host=0.0.0.0 I1017 15:44:10.975406 347901 master.cpp:300] Starting C++ HTTP metadata server on 0.0.0.0:8080 I1017 15:44:10.976037 347901 http_metadata_server.cpp:108] HTTP metadata server started on 0.0.0.0:8080 I1017 15:44:10.976052 347901 master.cpp:309] C++ HTTP metadata server started successfully I1017 15:44:12.011040 347901 rpc_service.cpp:172] HTTP metrics server started on port 9003 I1017 15:44:12.011780 347922 rpc_service.cpp:40] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B

2.# config: mooncake-config.yaml

chunk_size: 256 remote_url: "mooncakestore://localhost:50051/" remote_serde: "naive" local_cpu: True max_local_cpu_size: 20 extra_config: local_hostname: "192.168.255.80" metadata_server: "http://localhost:8080/metadata" protocol: "rdma" device_name: "mlx5_0" master_server_address: "localhost:50051" global_segment_size: 12884901888 local_buffer_size: 2147483648 eviction_high_watermark_ratio: 0.9 eviction_ratio: 0.1 transfer_timeout: 10

  1. start vllm + mooncake (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat MooncakeServe.sh timestamp=$(date +"%Y%m%d_%H%M%S")

PYTHONHASHSEED="1"
LMCACHE_USE_EXPERIMENTAL=True
LMCACHE_CONFIG_FILE="mooncake-config.yaml"
VLLM_USE_MODELSCOPE=True
CUDA_VISIBLE_DEVICES=2,3
vllm serve /workspaces/modelscope-yrcache/modelscope/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
--max-model-len 24576
--gpu-memory-utilization 0.9
--port 8202
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
--served-model-name Qwen-32B
--tensor-parallel-size 2 2>&1 | tee /workspaces/zhangjh/mooncake_test/Qwen-32B/logs/Qwen-32B-MooncakeServe_$timestamp.log

This is how I started it—on a single machine with a single GPU.

txh1873749380 avatar Nov 26 '25 14:11 txh1873749380

@stmatengss

txh1873749380 avatar Nov 26 '25 14:11 txh1873749380

hello ,i think you can read this vLLM V1 Disaggregated Serving with Mooncake Store and LMCache to deploy mooncake for each P/D instance. @txh1873749380

Keithwwa avatar Nov 27 '25 06:11 Keithwwa

1.#Start the mooncake main service (default port 50051) and the HTTP metadata service (default port 8080) (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat Mooncake_master.sh mooncake_master --rpc_port 50051 --metrics_port 9003 --enable_metric_reporting 1 --enable_http_metadata_server 1 --http_metadata_server_port 8080 --http_metadata_server_host 0.0.0.0 --root_fs_dir=/workspaces/zhangjh/mooncake_data --cluster_id=mooncake_cluster

Expected return

(MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# sh Mooncake_master.sh WARNING: Logging before InitGoogleLogging() is written to STDERR I1017 15:44:10.975198 347901 master.cpp:362] Master service started on port 50051, max_threads=4, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.95, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=4, rpc_port=50051, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, rpc protocol=tcp, cluster_id=mooncake_cluster, root_fs_dir=/mooncake, memory_allocator=offset, enable_http_metadata_server=1, http_metadata_server_port=8080, http_metadata_server_host=0.0.0.0 I1017 15:44:10.975406 347901 master.cpp:300] Starting C++ HTTP metadata server on 0.0.0.0:8080 I1017 15:44:10.976037 347901 http_metadata_server.cpp:108] HTTP metadata server started on 0.0.0.0:8080 I1017 15:44:10.976052 347901 master.cpp:309] C++ HTTP metadata server started successfully I1017 15:44:12.011040 347901 rpc_service.cpp:172] HTTP metrics server started on port 9003 I1017 15:44:12.011780 347922 rpc_service.cpp:40] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B

2.# config: mooncake-config.yaml

chunk_size: 256 remote_url: "mooncakestore://localhost:50051/" remote_serde: "naive" local_cpu: True max_local_cpu_size: 20 extra_config: local_hostname: "192.168.255.80" metadata_server: "http://localhost:8080/metadata" protocol: "rdma" device_name: "mlx5_0" master_server_address: "localhost:50051" global_segment_size: 12884901888 local_buffer_size: 2147483648 eviction_high_watermark_ratio: 0.9 eviction_ratio: 0.1 transfer_timeout: 10

  1. start vllm + mooncake (MooncakeStore_venv) root@ubuntu:/workspaces/zhangjh/MooncakeStore# cat MooncakeServe.sh timestamp=$(date +"%Y%m%d_%H%M%S")

PYTHONHASHSEED="1" LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE="mooncake-config.yaml" VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=2,3 vllm serve /workspaces/modelscope-yrcache/modelscope/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 24576 --gpu-memory-utilization 0.9 --port 8202 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}' --served-model-name Qwen-32B --tensor-parallel-size 2 2>&1 | tee /workspaces/zhangjh/mooncake_test/Qwen-32B/logs/Qwen-32B-MooncakeServe_$timestamp.log

This is how I started it—on a single machine with a single GPU.

You can run multiple vllm instances simultaneously, with each colocating with a Mooncake client. These Mooncake clients can share the same master.

stmatengss avatar Nov 27 '25 08:11 stmatengss

@stmatengss What you mean is to start two Mooncake services and keep the configurations the same, except that the root_fs_dir and cluster_id are different, or do you mean we should just start one Mooncake service? Because my understanding is that both machines should have data, but when starting with Ray, only one Mooncake node is found with data on its disk.

Image Image

txh1873749380 avatar Nov 27 '25 09:11 txh1873749380

@stmatengss

txh1873749380 avatar Nov 30 '25 10:11 txh1873749380

@Keithwwa

txh1873749380 avatar Nov 30 '25 10:11 txh1873749380