cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Discussion: Nitro to automagically handle Model Operations?

Open dan-jan opened this issue 8 months ago • 1 comments

Objective

  • I would like to float a thought balloon for Nitro to expand its feature set, and automagically handle model loading/unloading ops and abstract it from the Dev
  • The rise of multi-modal AI makes loading/unloading, VRAM and RAM management important
  • This represents an opportunity for us to grow Nitro's value add layer, vs. just being a "server.cpp"

Reasoning

  • Developers new to local AI are used to OpenAI, where they don't have to think about model loading/unloading
    • Devs will just call chat/completions with different model names
    • Devs will just call chat/completions with images, sound, files etc
  • As we progress towards multi-modal AI on resource-constrained devices, "model ops" is a concern
    • Hear, speak, think all require different models (even if we move to LLaVa style models)
    • Memory management, especially on constrained devices, is a field ripe for optimization
  • Jan will primarily be a VSCode-tier product (i.e. application layer)
    • Nitro & Jan should have clean abstractions driven by architectural decisions
    • I see Jan's roadmap as being very driven by Assistants, while leaving local model management to Nitro
    • Imho, this functionality should live with Nitro

Proposal

Nitro be "model aware"

  • A Nitro server would be "aware" of what models it has access to
  • Naturally, this would be via a /models folder (aligning with Jan has significant business benefits for us)
  • This would allow users to pass in model params in chat/completions, similar to OpenAI

Nitro to abstract model loading/unloading from user

  • Devs would just send chat/completions requests to Nitro
  • Nitro will handle loading/unloading (we just assume requests can always be queued in-memory)
  • Nitro -may- send "statuses" via SSE to requester, to indicate model loading time lag (Dan's question: is this a good idea?)

Nitro to have a /models endpoint

We would refactor current Nitro models endpoints:

Status Quo Recommended Change Function
/modelstatus GET v1/models Return state of models in Nitro
/loadmodel POST v1/models/load { ... params } Loads model
/unloadmodel POST v1/models/unload { ...params } Unloads model

Nitro to be aware of available RAM/VRAM

  • Debatable: Nitro should be aware of available RAM/VRAM
  • Nitro should figure out when to unload models (a lot of good techniques from RAM/Memory principles)

Open Questions

Q: What happens if a user wants to load a model with specific params? A: Nitro still retains a POST models/load endpoint for this purpose, and subsequent calls to Model will have those params (unless unloaded) A: In my opinion, the most Dev-friendly approach is to allow them to define a model.json with load configs, I don't see most developers needing to change it frequently?

Example

Developer populates Model folder:

# Folder structure
 /models
      /mistral-7b                            # { model-id }
            mistral-7b-gguf.q7.bin
            model.json                      # default params
     neuralchat-7b-ggufv3.q8.bin

Developer starts Nitro server:

# Nitro server
nitro --models ~/models

Developer proceeds to use Nitro

[User] 
# Request 1 to Mistral
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "mistral-7b",
     ...
  }
[Nitro] loads Mistral, and sends chat/completions back
[User] 
# Request 2 to Neural Chat (a bin file)
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "neuralchat-7b-ggufv3.q8.bin",
     ...
  }
[Nitro] sees that it has enough RAM to hold both Mistral and Neuralchat, and proceeds load Neuralchat
[User]
# Request 3 to Codellama, by passing in path to model
curl http://localhost:3928/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "/path/to/codellama-34b-gguf.q4_k_m.bin",   # Model not in folder
     ...
  }
[Nitro] Sees that Codellama is fairly large, and proceeds to unload Mistral but not Neuralchat

dan-jan avatar Nov 22 '23 13:11 dan-jan