Serverless inferencing, basic chatbot style
Is your feature request related to a problem? Please describe.
Serverless inferencing is possible with llama.cpp via llama-cli, for example one can do something like this:
llama-cli -m /home/curtine/.local/share/ramalama/models/ollama/granite-code:latest --log-disable --in-prefix --in-suffix --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv
this can be useful, no idle processes around, now single point of failure, no blocked port, each process can target a separate GPU/CPU, no need for root access to restart some daemon, etc.
Describe the solution you'd like
Serverless inferencing support, something similar to llama-cli but integrated with llama-cpp-python.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.