Toolio
Toolio copied to clipboard
Process pool for multiple loaded LLMs, and a queuing system from the FastAPI/uvicorn workers
In supporting concurrent requests, we won't at first assume concurrent inference capability at the model weight stage. This would require us to mutex the LLMs. We'll want control over the LLM supervision, in which case we might as well just support multiple LLM types hosted at once.
Will probably require some sort of config, for example (TOML). For example, to mount 2 instances of Meta Llama & one of Mistral Nemo:
[llm]
# Nickname to HF or local path
llama3-8B-8bit-1 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
llama3-8B-8bit-2 = "mlx-community/Meta-Llama-3.1-8B-Instruct-8bit"
mistral-nemo-8bit = "mlx-community/Mistral-Nemo-Instruct-2407-8bit"
Each instance would be resident in memory, so that is a natural limit. The client would request by model type, and the model supervisor would pick the first available one.
Of interest, to understand the expected deployment scenarios: