aibrix Support different lora adapter artifact registry

🚀 Feature Description and Motivation

apiVersion: model.aibrix.ai/v1alpha1
kind: ModelAdapter
metadata:
  name: text2sql-lora-1
  namespace: default
spec:
  baseModel: llama2-70b
  podSelector:
    matchLabels:
      model.aibrix.ai: llama2-70b
  additionalConfig:
    # could be model artifact etc.
    modelArtifact: yard1/llama-2-7b-sql-lora-test
  schedulerName: default-model-adapter-scheduler

Currently, we just use a simple modelArtifact key. this is kind of simple and it only support two "registries"

local: it has to be absolute path accessible from the container
huggingface repo id

This is definitely not enough. we need better structure to support more artifact registry and associated authN credentials

Use Case

Support S3 or other registries

Proposed Solution

No response

Jul 27 '24 07:07 Jeffwan

There're two options.

Make the support in engine side. we pass everything into the inference engine
Runtime should pick it up and download it. we need to change the control flow and invoke the lora registration with absolute path when the lora adapter is downloaded.

This is probably not the most critical task in rc1, we will consider to postpone to next rc release

Sep 06 '24 20:09 Jeffwan

// ArtifactURL is the address of the model artifact to be downloaded. Different protocol is supported like s3,gcs,huggingface
// +kubebuilder:validation:Required
ArtifactURL string `json:"artifactURL,omitempty"`

// CredentialsSecretRef points to the secret used to authenticate the artifact download requests
// +optional
CredentialsSecretRef *corev1.LocalObjectReference `json:"credentialsSecretRef,omitempty"`

Another challenge is how lora secretRef can be used by an existing pod. the target pod is the container need to consume the secretRef but this can not be completed in runtime.. this would be a blocker

Sep 06 '24 22:09 Jeffwan

Since it involves the design question, we can not finish this story by RC1. It can be moved to RC2 instead.

Sep 09 '24 17:09 Jeffwan

Design consideration

Engine starts with some credentials, then we can load lora, if lora requires credential to be downloaded, it has to be the credential we gave to engine.
Load_lora can accept the env variables, the engine leverages the credentials to download the model.
Engine only handle the absolute local path, lora is downloaded by runtime, once scheduling is done, controller triggers the runtime model download, once downloading is complete (how can it knows that in short latency), the controller triggers the model weights loading.
Engine just ship the task to runtime, runtime takes care of downloading and loading operations.

note:

We can not rely on pod level credential since that's immutable and there's no way to setup everything upfront.

Any security concerns to pass through the token or credentials?

/cc @brosoul Since task involves the runtime interaction, please help check it.

I will move this story to RC3, hard to deliver it in RC2

Sep 23 '24 20:09 Jeffwan

change to v0.2.0 instead

Oct 01 '24 23:10 Jeffwan

vLLM side

curl -X POST http://localhost:8000/v1/load_lora_adapter \
     -H "Content-Type: application/json" \
     -d '{"lora_name": "text2sql-lora-1", "lora_path": "bharati2324/Qwen2.5-1.5B-Instruct-Code-LoRA-r16v2"}'

curl -X POST http://localhost:8000/v1/unload_lora_adapter \
     -H "Content-Type: application/json" \
     -d '{"lora_name": "text2sql-lora-1"}'

model management

     curl -X POST http://localhost:8080/v1/lora_adapter/load \
     -H "Content-Type: application/json" \
     -d '{"lora_name": "text2sql-lora-1", "lora_path": "bharati2324/Qwen2.5-1.5B-Instruct-Code-LoRA-r16v2"}'

     curl -X POST http://localhost:8080/v1/lora_adapter/unload \
     -H "Content-Type: application/json" \
     -d '{"lora_name": "text2sql-lora-1"}'


curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "text2sql-lora-2",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

notice, lora mode list change has been release in v0.6.2, later release has regression issue.

Jan 20 '25 02:01 Jeffwan

Testing

update the controller manager settings

- --enable-runtime-sidecar

rebuild controller-manager and runtime

Jan 20 '25 11:01 Jeffwan

This can not be closed even with #580.. We didn't handle the orchestration like model download + model registration. Currently, it's still single step.

Jan 21 '25 00:01 Jeffwan

absolute path has been supported, then we can mount pvc now. I will postpone the artifact download part to future release

Feb 05 '25 23:02 Jeffwan