cortex
cortex copied to clipboard
Discussion: Nitro to automagically handle Model Operations?
Objective
- I would like to float a thought balloon for Nitro to expand its feature set, and automagically handle model loading/unloading ops and abstract it from the Dev
- The rise of multi-modal AI makes loading/unloading, VRAM and RAM management important
- This represents an opportunity for us to grow Nitro's value add layer, vs. just being a "server.cpp"
Reasoning
- Developers new to local AI are used to OpenAI, where they don't have to think about model loading/unloading
- Devs will just call
chat/completions
with different model names - Devs will just call
chat/completions
with images, sound, files etc
- Devs will just call
- As we progress towards multi-modal AI on resource-constrained devices, "model ops" is a concern
- Hear, speak, think all require different models (even if we move to LLaVa style models)
- Memory management, especially on constrained devices, is a field ripe for optimization
- Jan will primarily be a VSCode-tier product (i.e. application layer)
- Nitro & Jan should have clean abstractions driven by architectural decisions
- I see Jan's roadmap as being very driven by Assistants, while leaving local model management to Nitro
- Imho, this functionality should live with Nitro
Proposal
Nitro be "model aware"
- A Nitro server would be "aware" of what models it has access to
- Naturally, this would be via a
/models
folder (aligning with Jan has significant business benefits for us) - This would allow users to pass in
model
params inchat/completions
, similar to OpenAI
Nitro to abstract model loading/unloading from user
- Devs would just send
chat/completions
requests to Nitro - Nitro will handle loading/unloading (we just assume requests can always be queued in-memory)
- Nitro -may- send "statuses" via SSE to requester, to indicate model loading time lag (Dan's question: is this a good idea?)
Nitro to have a /models
endpoint
We would refactor current Nitro models endpoints:
Status Quo | Recommended Change | Function |
---|---|---|
/modelstatus | GET v1/models | Return state of models in Nitro |
/loadmodel | POST v1/models/load { ... params } | Loads model |
/unloadmodel | POST v1/models/unload { ...params } | Unloads model |
Nitro to be aware of available RAM/VRAM
- Debatable: Nitro should be aware of available RAM/VRAM
- Nitro should figure out when to unload models (a lot of good techniques from RAM/Memory principles)
Open Questions
Q: What happens if a user wants to load a model with specific params?
A: Nitro still retains a POST models/load
endpoint for this purpose, and subsequent calls to Model will have those params (unless unloaded)
A: In my opinion, the most Dev-friendly approach is to allow them to define a model.json
with load configs, I don't see most developers needing to change it frequently?
Example
Developer populates Model folder:
# Folder structure
/models
/mistral-7b # { model-id }
mistral-7b-gguf.q7.bin
model.json # default params
neuralchat-7b-ggufv3.q8.bin
Developer starts Nitro server:
# Nitro server
nitro --models ~/models
Developer proceeds to use Nitro
[User]
# Request 1 to Mistral
curl http://localhost:3928/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "mistral-7b",
...
}
[Nitro] loads Mistral, and sends chat/completions back
[User]
# Request 2 to Neural Chat (a bin file)
curl http://localhost:3928/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "neuralchat-7b-ggufv3.q8.bin",
...
}
[Nitro] sees that it has enough RAM to hold both Mistral and Neuralchat, and proceeds load Neuralchat
[User]
# Request 3 to Codellama, by passing in path to model
curl http://localhost:3928/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "/path/to/codellama-34b-gguf.q4_k_m.bin", # Model not in folder
...
}
[Nitro] Sees that Codellama is fairly large, and proceeds to unload Mistral but not Neuralchat