bumblebee-model-harness icon indicating copy to clipboard operation
bumblebee-model-harness copied to clipboard

Minimal Elixir application that can host AI models on Fly.io GPUs and make them available via WireGuard to remote clustered Elixir applications for improved development experiences.

Harness

See the accompanying blog post Easy at-home AI with Bumblebee and Fly GPUs

This is a minimal Elixir Phoenix web application that only exists to host a pre-trained Machine Learning model on a Fly.io machine with an attached GPU. The purpose is to make the model and GPU accessible to a separate Elixir application that is clustered with this app. In this way, this app is just a harness for the following:

  • Fetching the ML model (from HuggingFace)
  • Caching the downloaded model on a volume attached to the machine
  • Hosting the model through Bumblebee in an Nx.Serving, making it easy for Elixir to communicate with.

What's the advantage?

For a more detailed look at the advantages of doing this, please refer to this article.

In short, it's for the following reasons:

  • keep the benefits of rapid, local development
  • get access to large models, machines and GPUs
  • shut down machines when not in use
  • enables developing customized ML/AI code without compromising on the dev tooling or speed of development
  • bragging rights

Deploy this for yourself

This builds the Dockerfile image, deploys it, and starts the selected serving. For me, the process of starting the serving for a new Llama 2 model took about 4 minutes to download and start.

Track the logs if you like:

fly logs

Optional updates:

The fly.toml file has the auto_stop_machines = false setting. This is helpful when getting started so the machine doesn't get shutdown while the model is being downloaded. Once the machine is setup, feel free to change this value if that works best for your needs.

The VM size is set setting is size = "a100-40gb". This ensures the machine we get has the NVidia A100 GPU.

Selecting a ready-to-go model

Three LLMs are built-in and ready to go. Select the model to serve and enable it. Depending on the available hardware and the size of the model, hosting multiple models on the same GPU may not be practical or possible.

Select a single model to enable, deploy the harness application, and develop against it.

Models:

To select a model, uncomment it in lib/harness/application.ex and comment out the unused ones. This selects which serving to create and start. The following is an example of serving the Llama 2 model.

{Harness.DelayedServing,
  serving_name: Llama2ChatModel,
  serving_fn: fn -> Harness.Llama2Chat.serving() end},

In this example, the serving_name of Llama2ChatModel is the name of the serving to address in the client application. Name it whatever you like! It is the name used in the client when calling using the serving. In the client, it looks like this:

Nx.Serving.batched_run(Llama2ChatModel, "Say hello.")

The harness application uses a DelayedServing helper to start the model. Downloading and loading a large model takes time. It happens in the application startup process, which if done synchronously, makes the application unresponsive to health check. Fly will kill the app thinking it's unresponsive... which it is.

The DelayedServing makes the loading asynchronous so the application starts quickly and is responsive while the larger loading task continues.

Troubleshooting and diagnosis tips

To test and verify that you've successfully deployed the application to a machine with GPU access and that your application has all the necessary support for taking advantage of the GPU, do the following:

$ fly ssh console
# bin/harness remote
iex> Harness.DelayedServing.has_gpu_access?()
true

If you get a true response, then the machine, and your Elixir application both have access to the GPU.

Additionally, the harness app's logs report on successful GPU access or not for the Elixir application.

Logged messages:

  • info - "Elixir has CUDA GPU access! Starting serving #{serving_name}."
  • warning - "Elixir does not have GPU access. Serving will NOT be started."

Waiting for first-time model downloads

Depending on the model being used, it may be many GB in size and take several minutes to download.

If the Fly.io volume is setup correctly and available to the machine, the files are downloaded to /data/cache/bumblebee/huggingface/.

Before the Nx serving can be activated, the model must be fully downloaded, loaded into RAM, then moved to the GPU. Once complete, the serving is available for making calls against.

The attached volume caches the download so the local files are used the next time the harness application is started, skipping the lengthy download step.

Clustering your local app to the harness app on Fly.io

This documentation walks through the process: Easy Clustering from Home to Fly.io