text-generation-inference
text-generation-inference copied to clipboard
Unable to load/run LoRA Adapters on llama - 7B
System Info
TGI Versions:
- 2.4.0: Deployment fails with the error:
ERROR text_generation_launcher: Error when initializing model - 2.1.1: Deployment succeeds, but CURL requests fail with the error:
ERROR text_generation_launcher: Method Prefill encountered an error
Both the following LoRA adapters exhibit the same issue across versions:
vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSummxtuner/Llama-2-7b-qlora-moss-003-sft
Attached is the error log archive.
Configuration Details
- Platform: Docker
- Execution Method: Command Line Interface (CLI)
Deployment Setup
This YAML configuration was deployed on a single A100-40GB node (a2-highgpu-2g) within GKE:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi
labels:
app: tgi
spec:
selector:
matchLabels:
app: tgi
template:
metadata:
labels:
app: tgi
ai.gke.io/inference-server: text-generation-inference
examples.ai.gke.io/source: ai-on-gke-benchmarks
spec:
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
- name: data
hostPath:
path: /mnt/stateful_partition/kube-ephemeral-ssd/data
containers:
- name: text-generation-inference
ports:
- containerPort: 80
image: "ghcr.io/huggingface/text-generation-inference:2.4.0" # or "ghcr.io/huggingface/text-generation-inference:2.1.1"
args: ["--model-id", "meta-llama/Llama-2-7b-hf", "--num-shard", "1", "--max-concurrent-requests", "128", "--lora-adapters", "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: "1"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
name: data
startupProbe:
failureThreshold: 3600
periodSeconds: 10
timeoutSeconds: 60
exec:
command:
- /usr/bin/curl
- http://localhost:80/generate
- -X
- POST
- -d
- '{"inputs":"test", "parameters":{"max_new_tokens":1, "adapter_id" : "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"}}'
- -H
- 'Content-Type: application/json'
- --fail
---
apiVersion: v1
kind: Service
metadata:
name: tgi
labels:
app: tgi
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 80
protocol: TCP
selector:
app: tgi
Expected Behavior
Both deployment and startup should succeed without errors, as the same LoRA adapters are known to work with vLLM and LoRAX.