aibrix
aibrix copied to clipboard
Does it support speculative decoding with a draft model that is not an ngram?
Does it support speculative decoding with a draft model that is not an ngram?
If it is supported, how should the yaml be configured and is there any corresponding documentation?
@libin817927 We have not enabled such case yet. could you give more details on how it deployed at this moment? probably naive approach.
@libin817927 , here is an example:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: facebook-opt-6
model.aibrix.ai/port: "8000"
name: facebook-opt-6
namespace: default
spec:
strategy:
type: Recreate
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: facebook-opt-6
template:
metadata:
labels:
model.aibrix.ai/name: facebook-opt-6
spec:
nodeSelector:
workload: decode
volumes:
containers:
- command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- facebook/opt-6.7b
- --served-model-name
- facebook-opt-6
- --speculative-config
- '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
image: vllm/vllm-openai:v0.9.2
imagePullPolicy: IfNotPresent
name: vllm-openai
env:
- name: HF_TOKEN
value: hf_something
- name: VLLM_LOGGING_LEVEL
value: "DEBUG"
- name: PYTORCH_CUDA_ALLOC_CONF
value: "expandable_segments:True"
- name: NCCL_DEBUG
value: "WARN"
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: facebook-opt-6
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: facebook-opt-6
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: facebook-opt-6
type: ClusterIP