[Dist KV] vllm pods which do not have kvcache pods running in the same node crashes.

Open gangmuk opened this issue 9 months ago • 1 comments

🚀 Feature Description and Motivation

vllm pods which do not have kvcache pods running in the same node crashes.

All vllm pods should run with kvcache pod in the same node.

Temporary solution would be making kvcache pods spread in all nodes using affinity and antiaffinity. but it is not too unreliable. More elegant and reliable solution is needed.

Use Case

distributed kv cache set up

Proposed Solution

No response

Mar 14 '25 19:03 gangmuk

vllm pods which do not have kvcache pods running in the same node crashes.

If the node with engine pods doesn't have cache pod, engine pod will crash. affinity is one problem.

there's another issue I am a little bit concerned, right each cache pod mount a kv specific path instead of kv-instance level path. that means one node can only have one cache pod scheduled. it could be a problem as well.

Let's gradually improve it to more reliable status.

Mar 17 '25 04:03 Jeffwan