aibrix
aibrix copied to clipboard
[KVCache (Vineyard)] Do we need PodAffinity for model name in kvcache crd needs proper reconciliation?
π Describe the bug
Originally, kvcache pod and vllm-engine pods were scheduled at the same node by the PodAffinity in kvcache crd.
However, after restarting llama-3-8b-instruct, kv cache pod and engine pod are scheduled in different pods. The new engine pod is scheduled in the new node but kv cache pod stays in the same pod.
Not sure which way is the best to make kv cache pod and engine pod colocated.
Steps to Reproduce
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-3-8b-instruct
labels:
model.aibrix.ai/name: llama-3-8b-instruct
model.aibrix.ai/port: "8000"
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 100%
type: RollingUpdate
selector:
matchLabels:
model.aibrix.ai/name: llama-3-8b-instruct
template:
metadata:
labels:
model.aibrix.ai/name: llama-3-8b-instruct
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: machine.cluster.vke.volcengine.com/gpu-name
operator: In
values:
- NVIDIA-L20
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- /models/llama-3.1-8b-instruct/
- --served-model-name
- llama-3-8b-instruct
- --trust-remote-code
- --enable-chunked-prefill
- "false"
- --max-model-len
- "100000"
- --dtype
- bfloat16
- --disable-log-requests
- --swap-space
- "0"
# - --enable-prefix-caching
env:
- name: VLLM_USE_VINEYARD_CACHE
value: "0"
- name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB
value: "72"
- name: AIBRIX_LLM_KV_CACHE
value: "0"
- name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS
value: "aibrix"
- name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE
value: "16"
- name: AIBRIX_LLM_KV_CACHE_SOCKET
value: /var/run/vineyard.sock
- name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
value: "aibrix-kvcache-llama-3-8b-instruct-rpc:9600"
- name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE
value: "1"
- name: "VINEYARD_CACHE_METRICS_ENABLED"
value: "1"
- name: FLAGS_metrics
value: "1"
- name: GLOG_logtostderr
value: "1"
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.6.1-edb07092-20250118
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
while true; do
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
exit 0
else
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
sleep 5
fi
done
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 90
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 1
nvidia.com/gpu: "1"
requests:
cpu: 1
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /models
name: model-hostpath
- mountPath: /dev/shm
name: dshm
- mountPath: /var/run
name: kvcache-socket
- command:
- aibrix_runtime
- --port
- "8080"
env:
- name: INFERENCE_ENGINE
value: vllm
- name: INFERENCE_ENGINE_ENDPOINT
value: http://localhost:8000
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.2.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 1
name: aibrix-runtime
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
initContainers:
- command:
- aibrix_download
- --model-uri
- tos://aibrix-artifact-testing/models/llama-3.1-8b-instruct/
- --local-dir
- /models/
env:
- name: DOWNLOADER_MODEL_NAME
value: llama-3.1-8b-instruct
- name: DOWNLOADER_NUM_THREADS
value: "16"
- name: DOWNLOADER_ALLOW_FILE_SUFFIX
value: json, safetensors
- name: TOS_ACCESS_KEY
valueFrom:
secretKeyRef:
key: TOS_ACCESS_KEY
name: tos-credential
- name: TOS_SECRET_KEY
valueFrom:
secretKeyRef:
key: TOS_SECRET_KEY
name: tos-credential
- name: TOS_ENDPOINT
value: tos-cn-beijing.ivolces.com
- name: TOS_REGION
value: cn-beijing
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/runtime:v0.1.0
imagePullPolicy: IfNotPresent
name: init-model
resources: {}
volumeMounts:
- mountPath: /models
name: model-hostpath
terminationGracePeriodSeconds: 60
volumes:
- name: model-hostpath
hostPath:
path: /root/models
type: DirectoryOrCreate
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "4Gi"
- name: kvcache-socket
hostPath:
path: /var/run/vineyard-kubernetes/default/aibrix-kvcache-llama-3-8b-instruct
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: llama-3-8b-instruct
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: llama-3-8b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: llama-3-8b-instruct
type: ClusterIP
---
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
name: aibrix-kvcache-llama-3-8b-instruct
namespace: default
annotations:
kvcache.orchestration.aibrix.ai/pod-affinity-workload: llama-3-8b-instruct
kvcache.orchestration.aibrix.ai/pod-anti-affinity: "true"
kvcache.orchestration.aibrix.ai/node-affinity-gpu-type: NVIDIA-L20
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
cacheSpec:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vineyardd:20241120
imagePullPolicy: IfNotPresent
cpu: "4000m"
memory: 72Gi
Expected behavior
kv cache pod and engine are in the same node even after restarting engine pod.
Environment
main branch
ε°θ―ζεδΊδΉ