lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Can I deploy the service using Lorax without using lorax-launcher to start?

Open Nipi64310 opened this issue 1 year ago • 2 comments

Can I deploy the service using Lorax without using lorax-launcher to start, and instead load the model in the code?

Similar to HF and VLLM, I can use the following code to load the model.

# vllm
from vllm.engine.async_llm_engine import AsyncLLMEngine
model = AsyncLLMEngine.from_engine_args(engine_args)


# hf
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
    "model_id",
    torch_dtype="auto",
    device_map="auto"
)

Nipi64310 avatar Mar 06 '24 07:03 Nipi64310

Hi @Nipi64310, yes, this is possible. @alexsherstinsky has an example notebook demonstrating how to do this that he can share. He's alse planning on adding this to the lorax-client so it can be done natively in LoRAX like:

from lorax import Client, Server

server = Server(port=8080)

client = Client(server.base_url)
client.generat(...)

tgaddair avatar Mar 10 '24 21:03 tgaddair

Hello, @Nipi64310 ! Please take a look at this Google Colab notebook and before doing anything with it, please read the following explanations, caveats, and our plans.

For a quick background, LoRAX is a client-server mechanism for serving LLMs with fine-tuned adapters at scale. The LoRAX server is started with the given base model and listens on a host/port. The LoRAX clients connect to the LoRAX server on that host/port and send text prompts to it and receive the generated response. If the client specified the adapter_id, then the response comes from the fine-tuned model specified by that adapter_id; otherwise, the response comes from the base_model. Thus, a large number of fine-tuned adapters that have the same base model can be prompted against the LoRAX server concurrently.

The above-linked notebook is an educational example, which is still work-in-progress (reasons below).

The goal of this example is to illustrate both Ludwig and LoRAX in the same notebook to demonstrate fine-tuning several adapters with Ludwig and serve them all using LoRAX -- all of this is in the same notebook in order to make it easy for the reader to see all moving parts together, instead of having to toggle over multiple terminal shells, notebooks, and other screens.

Since this notebook contains both Ludwig and LoRAX examples, you are welcome to skip the Ludwig portion and focus only on the LoRAX parts.

The flow of this notebook has the following principal parts:

  1. Dataset preparation.
  2. Starting the LoRAX server for a base model (we used GPT2, in order to fit in the memory constraints of the GPU that comes with the free Google Colab account).
  3. Prompting the LoRAX server -- only the base model at first (in order to see how it performs without any fine-tuning). Please note that we do this in two different places, because each fine-tuning experiment has its own prompt.
  4. Fine-tuning the base model on the given dataset with Ludwig and saving the adapter (here we save it to HuggingFace, but this is not really necessary -- Ludwig saves the model to the local filesystem, and LoRAX can look it up locally as well). Again, we do this two times, one for each specific adapter use case (one use case is to generate news article content from the news headline; the second use case is the reverse -- summarize a news article content to generate the headline).
  5. Prompting the LoRAX server with the adapter_id as the argument and verifying that the quality of the result is better than that of the base model. Again, we do this two times, because we have two different adapters.

And now, to the essential mechanics concerning the launching of the LoRAX server in this notebook. While for production purposes, the LoRAX server should be launched on its own machine with the GPU that has sufficient memory to accommodate the given base model and the expected number of adapters, the purpose of the above notebook is strictly educational. Hence, it was critical for us to be able to launch the LoRAX server in a notebook cell in the background (i.e., so that the control is returned back to the notebook), while preserving the ability to see the important informational and/or error messages from the LoRAX server as it is starting out. If the LoRAX server could not be started in the background mode, then the notebook cell would continue to be running, thereby preventing us from running any other cell in this notebook.

Launching the LoRAX server in the background is accomplished through the use of the Python subprocess library. The LoRAX server itself is available in a Docker container (you will see the code in the notebook that downloads it). We also make use of udocker, which is a Python based Docker emulator, which supports only a subset of Docker container run commands. Thus, we execute udocker to run the LoRAX Docker container with Docker parameters, and also passing the environment variables to the LoRAX server itself. Because the "detached mode" (daemon) is not included in the list of Docker commands supported by udocker, we had to implement this capability ourselves; hence the use of the Python subprocess library, mentioned above. The implementation starts the passed shell command asynchronously.

Once the LoRAX server launch command is finished (it may take in excess of twenty minutes to "warm up" and start a large base model) and the notebook cell returns, we can instantiate the LoRAX Python client and interact with the server. The notebook provides examples of this for the two prompts without the fine-tuned adapters and with the fine-tuned adapters.

And now, a critical note explaining why this is still work-in-progress. While I was able to run this notebook in Google Colab a couple of times (as is evidenced by the output retained in the shared notebook), Google deems running a server in a notebook as contravening its usage policies and may block access to Colab if detected. This happened to me, and I am currently having an email communication with Google support, explaining that this code is for educational purposes only, and no harm or abuse of any kind is intended. While Google support team are responding, the turn-around time is several business days, and it is quite possible that in the end they will not change their policy, due to their company size. For these reasons, we are currently exploring other options for presenting this tutorial in a single notebook.

First, if you have an environment with a GPU where you can run Jupyter, then you can already save this notebook in the Jupyter format and run it on that machine. To connect to the Jupyter server from a client machine, you would need to run a port-forwarding command. We already have this working on our experimentation environment at Predibase (which is built out of Kubernetes pods). In fact, I built this notebook in this environment first, and only then ported it to Google Colab. Second, we are going to look at other mechanisms, such as SkyPilot and RunPod -- the idea being that while the setup and sharing may be somewhat more complex than just having everything in a Golab notebook (i.e., no setup is needed whatsoever, and sharing is like any Google document), there will be no usage restrictions. I am happy to keep you posted.

Please let me know if you have any comments or questions regarding this description and the included notebook. I would be more than happy to jump on a Google Meet call with you for a faster discussion. Thank you very much for your interest!

P.S.: Update on 4/26/2024 (but factual as of 3/31/2024): The above notebook is self-contained (much improved from the "experimental" status). Having said that, the work for folding this type of way to launch LoRAX into the present repository is still work to be done (and is planned to start in the relative near term). Please follow this repository for updates. Thank you!

alexsherstinsky avatar Mar 11 '24 16:03 alexsherstinsky