aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Does it support speculative decoding with a draft model that is not an ngram?

Open libin817927 opened this issue 5 months ago • 2 comments

Does it support speculative decoding with a draft model that is not an ngram?

If it is supported, how should the yaml be configured and is there any corresponding documentation?

libin817927 avatar Jul 25 '25 07:07 libin817927

@libin817927 We have not enabled such case yet. could you give more details on how it deployed at this moment? probably naive approach.

Jeffwan avatar Jul 28 '25 20:07 Jeffwan

@libin817927 , here is an example:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: facebook-opt-6 
    model.aibrix.ai/port: "8000"
  name: facebook-opt-6
  namespace: default
spec:
  strategy:
    type: Recreate
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: facebook-opt-6
  template:
    metadata:
      labels:
        model.aibrix.ai/name: facebook-opt-6
    spec:
      nodeSelector:
        workload: decode
      volumes:
      containers:
        - command:
            - python3
            - -m
            - vllm.entrypoints.openai.api_server
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - facebook/opt-6.7b
            - --served-model-name
            - facebook-opt-6
            - --speculative-config
            - '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
          image: vllm/vllm-openai:v0.9.2
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          env:
            - name: HF_TOKEN
              value: hf_something
            - name: VLLM_LOGGING_LEVEL
              value: "DEBUG"
            - name: PYTORCH_CUDA_ALLOC_CONF
              value: "expandable_segments:True"
            - name: NCCL_DEBUG
              value: "WARN"
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: facebook-opt-6
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: facebook-opt-6
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: facebook-opt-6
  type: ClusterIP

omerap12 avatar Sep 09 '25 20:09 omerap12