Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[Usage]: When using MooncakeDistributedStore in Python, after the written client is closed, other clients cannot query the written value

Open hhwode opened this issue 6 months ago • 10 comments

Describe your usage question

when using at node 1:

# node 1
from mooncake.store import MooncakeDistributedStore
store = MooncakeDistributedStore()
store.setup("10.248.92.175", "http://10.248.253.230:8080/metadata", 3200 * 1024 * 1024, 512 * 1024 * 1024, "rdma", "mlx5_2", "10.248.92.175:54321")
key = "test_teardown_key"
test_data = b"Hello, World!"
store.put(key, test_data)
store.is_exist(key)  # ret=1, exist

using in node 2:

from mooncake.store import MooncakeDistributedStore
store = MooncakeDistributedStore()
store.setup("10.248.253.230", "http://10.248.253.230:8080/metadata", 3200 * 1024 * 1024, 512 * 1024 * 1024, "rdma", "mlx5_2", "10.248.92.175:54321")
key = "test_teardown_key"
store.is_exist(key)  # ret=1, exist when node 1 is runing
store.is_exist(key)  # ret=0, not exist when node 1 is closed

It is right for store.put, at document https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md. The put operation will copy the replica to other client, but KV seems to only exist on the writing client, and it was deleted after being closed.

master service logs:

I0709 17:14:49.469707  4236 rpc_service.h:136] Master Metrics: Storage: 13.00 B / 3.12 GB (0.0%) | Keys: 1 | Requests (Success/Total): Put=2/2, Get=0/0, Exist=1/2, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0709 17:14:59.469836  4236 rpc_service.h:136] Master Metrics: Storage: 13.00 B / 6.25 GB (0.0%) | Keys: 1 | Requests (Success/Total): Put=2/2, Get=0/0, Exist=1/2, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0709 17:15:09.469965  4236 rpc_service.h:136] Master Metrics: Storage: 13.00 B / 6.25 GB (0.0%) | Keys: 1 | Requests (Success/Total): Put=2/2, Get=0/0, Exist=2/3, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B
I0709 17:15:19.470116  4236 rpc_service.h:136] Master Metrics: Storage: 0.00 B / 3.12 GB (0.0%) | Keys: 0 | Requests (Success/Total): Put=2/2, Get=0/0, Exist=2/3, Del=0/0, DelAll=0/0,  | Eviction: Success/Attempts=0/0, keys=0, size=0.00 B

Before submitting a new issue...

  • [ ] Make sure you already searched for relevant issues and read the documentation

hhwode avatar Jul 09 '25 09:07 hhwode

Right now in the Python bindings, we default the replica number to 1 and try to allocate memory form itself first. I think this may clear up the confusion you had.

I'll submit a PR soon to add support for ReplicaConfig in the Python bindings.

xiaguan avatar Jul 09 '25 09:07 xiaguan

Right now in the Python bindings, we default the replica number to 1 and try to allocate memory form itself first. I think this may clear up the confusion you had.

I'll submit a PR soon to add support for ReplicaConfig in the Python bindings.

Is there any other way to configure the number of replicas on the Python side,and the document states that it is possible to offload kv to disk through MOONCAKE-STORAGE-ROOT_DIR. Does Python not support this.

hhwode avatar Jul 09 '25 09:07 hhwode

https://github.com/kvcache-ai/Mooncake/pull/608, Python binding's replica config PR has been submitted—just waiting on review and merge.

As for offloading to disk: yes, the Python client does support this. You just need to set the appropriate environment variable. That said, this feature is still experimental for now, and it's mainly intended to serve as a read cache for the client itself. @SgtPepperr , could u help answer this?

xiaguan avatar Jul 09 '25 10:07 xiaguan

The PR has been merged, but we haven't published a new release yet. If you need it right away, you can build from main. Python now supports setting ReplicateConfig directly when calling put() releated interface.

xiaguan avatar Jul 10 '25 06:07 xiaguan

test-script:mooncake-store/tests/stress_cluster_benchmark.py During concurrent execution of the test-script as role=prefill on node1 and role=decode on node2, the error 'Transfer failed for batch xxxxx task 0 with status 6' occurred, which is confusing. @xiaguan

LuyuZhang00 avatar Jul 10 '25 09:07 LuyuZhang00

Could you share more detailed logs? And please checkout your config.

Status 6 means the transfer failed.

xiaguan avatar Jul 10 '25 09:07 xiaguan

example E0710 09:48:23.776901 5615 store_py.cpp:1202] BatchGet failed for key 'key1199': TRANSFER_FAIL

parser.add_argument("--local-hostname", type=str, default="localhost", help="Local hostname") in node1 :local-hostname is node 1 name, in node2: I set node2, which localhost should set

LuyuZhang00 avatar Jul 10 '25 10:07 LuyuZhang00

Is it possible that the prefill node exited during the data transfer? Since we're using the client's own memory for storage, that could explain it.

xiaguan avatar Jul 10 '25 10:07 xiaguan

#608, Python binding's replica config PR has been submitted—just waiting on review and merge.

As for offloading to disk: yes, the Python client does support this. You just need to set the appropriate environment variable. That said, this feature is still experimental for now, and it's mainly intended to serve as a read cache for the client itself. @SgtPepperr , could u help answer this?

The PR has been merged, but we haven't published a new release yet. If you need it right away, you can build from main. Python now supports setting ReplicateConfig directly when calling put() releated interface.

Thanks, i will try it.

hhwode avatar Jul 11 '25 03:07 hhwode

When any client is shut down, we should provide an option to migrate data.

stmatengss avatar Nov 25 '25 05:11 stmatengss