immich
immich copied to clipboard
refactor(ml): ray serve
This PR integrates Ray Serve into the ML server.
Motivation:
- Optimized memory consumption.
- Batching for faster inference.
- More maintainable code structure.
Features:
- Easy process and thread scheduling. Ray supports many deployment schemes, allowing developers to scale models up and down as needed without tedium.
- Following #2661, it was noted that memory after model unloading is still notably higher than at startup. Since Ray allows scoping models in individual processes, model unloading is now much more robust.
- Microbatching. Following #2693, server-side batching is a difficult task that requires structural changes in the server code and added complexity, especially for error handling and post-upload inference. This feature allows for seamless batching of multiple requests into one model batch, returned as if it were never batched. This means the server code can be left largely intact. Meanwhile, the feature is a simple
@serve.batchdecorator. - Forward-minded. The purpose-built tools that Ray offer make future ML development simpler and simultaneously more robust. For example, it supports changing model configurations on-the-fly from requests with a simple
reconfiguremethod. It even comes with a dashboard for better monitoring, although not currently exposed outside of the docker container.
Implementation:
- Models are run in a shared handler process while endpoints are in a separate ingress process.
- They cannot be the same since the model process must be able to terminate while still allowing requests to spin it up again when needed.
- An earlier implementation deployed each model as its own process. This allowed for simpler code and was optimal for inference speed since it bypasses Python's GIL. However, RAM usage was notably higher when run in this way, meaning it could cause OOM issues.
- Multiple requests are combined into a list by Serve through an async decorator. The method executes once the max batch size is reached or the timeout has elapsed, whichever comes first.
- A timeout of 10ms seems to work well for local deployment, but should be higher for remote deployments to cover variations in latency.
- A model cache in the shared model process is responsible for loading and unloading individual models.
- While the process will be automatically terminated after idling for the
MACHINE_LEARNING_MODEL_TTLduration, it's possible for all models to be loaded when only one is actively used.
- While the process will be automatically terminated after idling for the
- Models are loaded in a separate thread, so loaded models are still able to respond to requests.
- This is to ensure no downtime. Microbatching (and asyncio in general) is time-sensitive, so blocking while loading models is disruptive to performance.
Other:
- Temporary downgrade from Python 3.11 to 3.10 as Ray only has experimental support as of 2.5.
Fixes #3142
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| immich | ⬜️ Ignored (Inspect) | Jul 10, 2023 5:48pm |
I appreciate the work and the initiative from you. However, for significant change, the team would appreciate you discussing with us first before you start putting your effort into the work. This would help us understand the anticipate change and discuss with you what we'd like the direction to be. Can you please open a focus-dicussion on Discord and help us understand the changes here?
Sure thing. This is mainly to address concerns raised in previous PRs, particularly #2661 and #2693. The goal of this PR is to be as seamless a change as possible.
Deployment failed with the following error:
Resource is limited - try again in 1 hour (more than 100, code: "api-deployments-free-per-day").
Rebased. It's much smaller than before since a big chunk of it has already been merged.
After some optimizations for RAM usage, the numbers are roughly like this when tested with Locust:
- ~1.3gb on startup
- ~3gb with all models under load
- ~1.3gb after models are unloaded
Closing this since my experience with Ray so far has been that it doesn't fully address memory issues and introduces some of its own. A lighter solution may be more appropriate.