text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Unable to load/run LoRA Adapters on llama - 7B

Open kaushikmitr opened this issue 1 year ago • 0 comments


System Info

TGI Versions:

  • 2.4.0: Deployment fails with the error:
    ERROR text_generation_launcher: Error when initializing model
  • 2.1.1: Deployment succeeds, but CURL requests fail with the error:
    ERROR text_generation_launcher: Method Prefill encountered an error

Both the following LoRA adapters exhibit the same issue across versions:

  1. vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
  2. xtuner/Llama-2-7b-qlora-moss-003-sft

Attached is the error log archive.


Configuration Details

  • Platform: Docker
  • Execution Method: Command Line Interface (CLI)

Deployment Setup

This YAML configuration was deployed on a single A100-40GB node (a2-highgpu-2g) within GKE:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi
  labels:
    app: tgi
spec:
  selector:
    matchLabels:
      app: tgi
  template:
    metadata:
      labels:
        app: tgi
        ai.gke.io/inference-server: text-generation-inference
        examples.ai.gke.io/source: ai-on-gke-benchmarks
    spec:
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
        - name: data
          hostPath:
            path: /mnt/stateful_partition/kube-ephemeral-ssd/data
      containers:
        - name: text-generation-inference
          ports:
            - containerPort: 80
          image: "ghcr.io/huggingface/text-generation-inference:2.4.0"  # or "ghcr.io/huggingface/text-generation-inference:2.1.1"
          args: ["--model-id", "meta-llama/Llama-2-7b-hf", "--num-shard", "1", "--max-concurrent-requests", "128", "--lora-adapters", "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"]
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "1"
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /data
              name: data
          startupProbe:
            failureThreshold: 3600
            periodSeconds: 10
            timeoutSeconds: 60
            exec:
              command:
                - /usr/bin/curl
                - http://localhost:80/generate
                - -X
                - POST
                - -d
                - '{"inputs":"test", "parameters":{"max_new_tokens":1, "adapter_id" : "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"}}'
                - -H
                - 'Content-Type: application/json'
                - --fail
---
apiVersion: v1
kind: Service
metadata:
  name: tgi
  labels:
    app: tgi
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: tgi

Expected Behavior

Both deployment and startup should succeed without errors, as the same LoRA adapters are known to work with vLLM and LoRAX.


kaushikmitr avatar Nov 05 '24 23:11 kaushikmitr