OpenLLM icon indicating copy to clipboard operation
OpenLLM copied to clipboard

feat: LoRA loading per request

Open aarnphm opened this issue 2 years ago • 18 comments

Feature request

Add a parameters to load lora on request

Motivation

No response

Other

No response

aarnphm avatar Jul 25 '23 16:07 aarnphm

Hey @aarnphm, are you asking for a way to load in an adapter per query for a given base model? For e.g., say you have facebook/opt-350m running and you're querying it, but now being able to load in an adapter for the model that's already running in the service without creating a fresh openllm build?

If yes, then this is also something I'm looking for but currently that doesn't seem possible

arnavgarg1 avatar Aug 03 '23 18:08 arnavgarg1

This is already supported, you can pass it in via --adapter-id during startup, and then specify the adapter_name per request

aarnphm avatar Aug 03 '23 18:08 aarnphm

But I haven't announced it yet, since I'm still testing with this

I think the more important use case is for users to provide a remote adapter; what would that look like?

since we will most likely have to download it to each runner instance and the load it dynamically. This will increase the latency by a lot.

--adapter-id was the first solution for loading all lora layers. If you only have a few layers, then it is ok to load everything into memory. However, if you are working with 100+ lora layers, this strategy is not feasible

I'm still doing some research around this. Please let me know if you find something. Feel free to join our discord as I will be more active there.

aarnphm avatar Aug 03 '23 18:08 aarnphm

Right, I gave that a try and it looks like that works.

I guess what I was wondering is if it's possible to do the following:

  • At startup time, just load the model without adapters
  • I might at a later time train some task-specific adapters
  • I want to load these adapters into the already-running model, essentially hot swapping them for each task but using the same existing deployment that might already be running. I might incrementally train even more adapters for other downstream tasks but don't want to spin up a new service, just re-use what's already running

arnavgarg1 avatar Aug 03 '23 18:08 arnavgarg1

Hey @aarnphm, are you asking for a way to load in an adapter per query for a given base model? For e.g., say you have facebook/opt-350m running and you're querying it, but now being able to load in an adapter for the model that's already running in the service without creating a fresh openllm build?

I think openllm build is irrelevant here, since this should be independent of start or build

aarnphm avatar Aug 03 '23 18:08 aarnphm

Right, I gave that a try and it looks like that works.

I guess what I was wondering is if it's possible to do the following:

  • At startup time, just load the model without adapters
  • I might at a later time train some task-specific adapters
  • I want to load these adapters into the already-running model, essentially hot swapping them for each task but using the same existing deployment that might already be running. I might incrementally train even more adapters for other downstream tasks but don't want to spin up a new service, just re-use what's already running

Yes this is the purpose of this. So that you don't have to rebuild the service

aarnphm avatar Aug 03 '23 18:08 aarnphm

Totally makes sense, and it is super useful if you know the adapters in advance! I guess this doesn't work if you train a new adapter after the service is already deployed, which is what I would love to have. Not sure what it would take to make that happen.

arnavgarg1 avatar Aug 03 '23 18:08 arnavgarg1

The purpose of adapter_name per request here is that after you train a new lora layer, you can pass a remote URL to it and then it will just load the layers into the model.

imo the default behaviour of this should be loading the lora layer -> request -> output -> unload lora layer.

There probably a case where we just allow the lora layer to loaded with the models, and such layer can just be used afterwards. Just note that in this case, we might need to somehow manage state (which is very complicated and hard atm since runners are stateless)

aarnphm avatar Aug 03 '23 19:08 aarnphm

I don't think local path would makes sense, since there is no way for the server to resolve it or import it into memory

aarnphm avatar Aug 03 '23 19:08 aarnphm

The purpose of adapter_name per request here is that after you train a new LoRA layer, you can pass a remote URL to it and then it will just load the layers into the model.

How can I use adapter_name? Is it different from --adapter-id? For e.g., right now I want to use facebook/opt-350m with a fine-tuned LoRA adapter hosted on huggingface at arnavgrg/opt350m_lora_test so I'm doing openllm start opt --model-id facebook/opt-350m --adapter-id arnavgrg/opt350m_lora_test. What I want to simulate is openllm start opt --model-id facebook/opt-350m and then hit /v1/adapter with arnavgrg/opt350m_lora_test dynamically for a specific query.

imo the default behaviour of this should be loading the lora layer -> request -> output -> unload lora layer.

I agree. Parse the request, get the remote URL where the adapter weights live, download the weights, load the Lora layer, complete the request, return the output. I would probably consider doing the unload lora layer part only if a different lora layer is passed into the request. But I guess that assumes a single client querying the service but it won't work for concurrent requests. So yeah, I guess unload lora layer should be ok. Typically loading adapter weights is not very expensive so I think your proposed pathway sounds good.

I don't think local path would makes sense, since there is no way for the server to resolve it or import it into memory

Agreed. Has to be a hosted weights remotely somewhere so it can be downloaded into the service, local paths can't be supported for that specific case.

arnavgarg1 avatar Aug 03 '23 19:08 arnavgarg1

you can provide the adapter_name via the swagger UI, under the same level as prompt key.

aarnphm avatar Aug 03 '23 22:08 aarnphm

In terms of supporting path, maybe we can add support for s3. But initially we will probably support loading from huggingface hub for now

aarnphm avatar Aug 03 '23 22:08 aarnphm

Hi @aarnphm, can you clarify this point?

you can provide the adapter_name via the swagger UI, under the same level as prompt key.

I've deployed an OPT model using the following script (named service.py):

import openllm
import bentoml
import torch
from collections import defaultdict


llm_runner = openllm.Runner("opt", model_id="facebook/opt-6.7b", torch_dtype=torch.float16, device_map="cuda")
svc = bentoml.Service(name="llm-service", runners=[llm_runner])


@svc.on_startup
def download(_: bentoml.Context):
    llm_runner.download_model()


@svc.api(input=bentoml.io.Text(), output=bentoml.io.Text())
async def prompt(input_text: str) -> str:
    answer = await llm_runner.generate.async_run(input_text)
    return answer[0]

This successfully deploys when I run bentoml serve service:svc.

Now, I am trying to load an adapter via CURL command, but can't seem to get the endpoint right. I'm trying the following, but getting a 404:

$ curl -X POST http://localhost:3000/v1/adapters -H "Content-Type: application/json" -d '{"adapter_name": "aarnphm/opt-6.7b-lora:french_lora"}' -v
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:3000...
* Connected to localhost (127.0.0.1) port 3000 (#0)
> POST /v1/adapters HTTP/1.1
> Host: localhost:3000
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 53
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Thu, 10 Aug 2023 01:22:52 GMT
< server: uvicorn
< content-length: 9
< content-type: text/plain; charset=utf-8
<
* Connection #0 to host localhost left intact

geoffreyangus avatar Aug 10 '23 01:08 geoffreyangus

Oh I need to update how to load the lora layers. The endpoint /v1/adapters doesn't make sense because of broadcasting issue with multiple runners.

Can you try with passing adapter_name into the request of /v1/generate?

aarnphm avatar Aug 10 '23 02:08 aarnphm

Got it– I'm also trying now by using the vanilla openllm CLI tool. Here's how I deployed the model:

openllm start falcon --model-id tiiuae/falcon-7b-instruct

Here's what I submitted:

curl -X POST http://localhost:3000/v1/generate -H "Content-Type: application/json" -d '{"prompt": "What is the capital of Italy?", "adapter_name": "zeynab2021/falcon-7b-instruct-ft-adapters-hotel-qa"}' -v

Here's what I got back:

2023-08-10T02:32:23+0000 [INFO] [cli:llm-falcon-service:31] 127.0.0.1:49066 (scheme=http,method=POST,path=/v1/generate,type=application/json,length=113) (status=500,type=application/json,length=110) 7.566ms (trace=a171d9efd70bcd71191c5b0299d94788,span=b3799cc34b1bf125,sampled=1,service.name=llm-falcon-service)
2023-08-10T02:32:41+0000 [ERROR] [cli:llm-falcon-service:31] Exception on /v1/generate [POST] (trace=42c6423552c9f6a1ecf14547929f6554,span=7cadc91b0a1e9ea1,sampled=1,service.name=llm-falcon-service)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/opt/conda/lib/python3.9/site-packages/openllm/_service.py", line 43, in generate_v1
    qa_inputs = openllm.GenerationInput.for_model(model)(**input_dict)
TypeError: __init__() got an unexpected keyword argument 'adapter_name'

This is on openllm==0.2.17

geoffreyangus avatar Aug 10 '23 02:08 geoffreyangus

Got it, I will take a look.

aarnphm avatar Aug 10 '23 18:08 aarnphm

This is now supported via adapter_name and should be fixed with new implementation in 0.4

aarnphm avatar Nov 07 '23 22:11 aarnphm

For loading lora layers, I'm thinking we might need to figure out how to forecast in a distributed environment, i.e. k8s

aarnphm avatar Nov 07 '23 22:11 aarnphm

close for openllm v0.6 refactoring

bojiang avatar Jul 13 '24 05:07 bojiang