Throughput and Latency degradation with a single LoRA adapter on A100 40 GB
System Info
Setup Summary for LoRAX Benchmarking with Llama-2 Model:
- Hardware: A100 40 GB (a2-highgpu-2g) on Google Kubernetes Engine (GKE)
- Image: ghcr.io/predibase/lorax:latest
- Model:
meta-llama/Llama-2-7b-hf - GPU Count: 1
- Experiments:å
- Experiment 1: Requests using the base model
meta-llama/Llama-2-7b-hf. - Experiment 2: lorax deployed with LoRA adapter
vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm(size 160 MB). - Experiment 3: lorax deployed with LoRA adapter
xtuner/Llama-2-7b-qlora-moss-003-sft(size 640 MB).
- Experiment 1: Requests using the base model
Each experiment ran for about 100 seconds for each QPS value. For all three experiments, we used the same input prompt (ShareGPT) and observed a similar output length.
Benchmark Metrics: We measured:
- Latency per output token
- Throughput (output tokens per second)
You can view detailed results in the benchmark document: Benchmark 1 server - LoRAX.pdf
Observations and Questions:
- Using LoRA adapters led to a notable degradation in throughput and latency compared to the base model. Specifically, we observed up to a 50% drop in maximum throughput with LoRA compared to the base model.
- Is this performance degradation expected with LoRA adapters?
- Are there parameters or tuning options that could improve LoRA performance?
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Sample Query:
curl -i ${IP}:${PORT}//generate -X POST -d '{
"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
"parameters": {
"max_new_tokens": 10,
"adapter_ids" : "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"
}
}' -H 'Content-Type: application/json'
Deployment YAML Configuration:
apiVersion: v1
kind: Service
metadata:
name: lorax-llama2-7b-pool
spec:
selector:
app: lorax-llama2-7b-pool
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: lorax-llama2-7b-pool
spec:
replicas: 1
selector:
matchLabels:
app: lorax-llama2-7b-pool
template:
metadata:
labels:
app: lorax-llama2-7b-pool
spec:
containers:
- name: lora
image: "ghcr.io/predibase/lorax:latest"
imagePullPolicy: Always
#command: ["python3", "-m", "lorax.entrypoints.openai.api_server"]
args:
- "--model-id"
- "meta-llama/Llama-2-7b-hf"
env:
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
ports:
- containerPort: 8000
name: http
protocol: TCP
livenessProbe:
failureThreshold: 240
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 600
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /data
name: data
- mountPath: /dev/shm
name: shm
restartPolicy: Always
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
volumes:
- name: data
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
- name: adapters
emptyDir: {}
---
Expected Behavior
After reviewing the LoRAX blog, particularly the statement:
"Processing 1M tokens spread evenly across 32 different fine-tuned models takes just about as much time as processing the same number of tokens for 1 fine-tuned model due to the near-optimal multi-adapter batching throughput associated with LoRAX."
I anticipated a smaller decrease in latency and throughput when using LoRA adapters. Could you please clarify how the savings in Figure 1 were calculated? It would be helpful to understand if this level of performance degradation is typical and if there are any specific tuning options that might help mitigate this. Thank you for your guidance.