immich refactor(ml): ray serve

This PR integrates Ray Serve into the ML server.

Motivation:

Optimized memory consumption.
Batching for faster inference.
More maintainable code structure.

Features:

Easy process and thread scheduling. Ray supports many deployment schemes, allowing developers to scale models up and down as needed without tedium.
Following #2661, it was noted that memory after model unloading is still notably higher than at startup. Since Ray allows scoping models in individual processes, model unloading is now much more robust.
Microbatching. Following #2693, server-side batching is a difficult task that requires structural changes in the server code and added complexity, especially for error handling and post-upload inference. This feature allows for seamless batching of multiple requests into one model batch, returned as if it were never batched. This means the server code can be left largely intact. Meanwhile, the feature is a simple @serve.batch decorator.
Forward-minded. The purpose-built tools that Ray offer make future ML development simpler and simultaneously more robust. For example, it supports changing model configurations on-the-fly from requests with a simple reconfigure method. It even comes with a dashboard for better monitoring, although not currently exposed outside of the docker container.

Implementation:

Models are run in a shared handler process while endpoints are in a separate ingress process.
- They cannot be the same since the model process must be able to terminate while still allowing requests to spin it up again when needed.
- An earlier implementation deployed each model as its own process. This allowed for simpler code and was optimal for inference speed since it bypasses Python's GIL. However, RAM usage was notably higher when run in this way, meaning it could cause OOM issues.
Multiple requests are combined into a list by Serve through an async decorator. The method executes once the max batch size is reached or the timeout has elapsed, whichever comes first.
- A timeout of 10ms seems to work well for local deployment, but should be higher for remote deployments to cover variations in latency.
A model cache in the shared model process is responsible for loading and unloading individual models.
- While the process will be automatically terminated after idling for the MACHINE_LEARNING_MODEL_TTL duration, it's possible for all models to be loaded when only one is actively used.
Models are loaded in a separate thread, so loaded models are still able to respond to requests.
- This is to ensure no downtime. Microbatching (and asyncio in general) is time-sensitive, so blocking while loading models is disruptive to performance.

Other:

Temporary downgrade from Python 3.11 to 3.10 as Ray only has experimental support as of 2.5.

Fixes #3142

Jun 14 '23 02:06 mertalev

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
immich	⬜️ Ignored (Inspect)			Jul 10, 2023 5:48pm

Jun 14 '23 02:06 vercel[bot]

I appreciate the work and the initiative from you. However, for significant change, the team would appreciate you discussing with us first before you start putting your effort into the work. This would help us understand the anticipate change and discuss with you what we'd like the direction to be. Can you please open a focus-dicussion on Discord and help us understand the changes here?

Jun 14 '23 02:06 alextran1502

Sure thing. This is mainly to address concerns raised in previous PRs, particularly #2661 and #2693. The goal of this PR is to be as seamless a change as possible.

Jun 14 '23 02:06 mertalev

Deployment failed with the following error:

Resource is limited - try again in 1 hour (more than 100, code: "api-deployments-free-per-day").

Jun 16 '23 22:06 vercel[bot]

Rebased. It's much smaller than before since a big chunk of it has already been merged.

Jun 28 '23 02:06 mertalev

After some optimizations for RAM usage, the numbers are roughly like this when tested with Locust:

~1.3gb on startup
~3gb with all models under load
~1.3gb after models are unloaded

Jul 10 '23 17:07 mertalev

Closing this since my experience with Ray so far has been that it doesn't fully address memory issues and introduces some of its own. A lighter solution may be more appropriate.

Jul 19 '23 20:07 mertalev

immich immich copied to clipboard

refactor(ml): ray serve

immich
immich copied to clipboard