Serverless inferencing, basic chatbot style

Open ericcurtin opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe.

Serverless inferencing is possible with llama.cpp via llama-cli, for example one can do something like this:

llama-cli -m /home/curtine/.local/share/ramalama/models/ollama/granite-code:latest --log-disable --in-prefix --in-suffix --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv

this can be useful, no idle processes around, now single point of failure, no blocked port, each process can target a separate GPU/CPU, no need for root access to restart some daemon, etc.

Describe the solution you'd like

Serverless inferencing support, something similar to llama-cli but integrated with llama-cpp-python.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

Sep 23 '24 13:09 ericcurtin