llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

Serverless inferencing, basic chatbot style

Open ericcurtin opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe.

Serverless inferencing is possible with llama.cpp via llama-cli, for example one can do something like this:

llama-cli -m /home/curtine/.local/share/ramalama/models/ollama/granite-code:latest --log-disable --in-prefix --in-suffix --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv

this can be useful, no idle processes around, now single point of failure, no blocked port, each process can target a separate GPU/CPU, no need for root access to restart some daemon, etc.

Describe the solution you'd like

Serverless inferencing support, something similar to llama-cli but integrated with llama-cpp-python.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

ericcurtin avatar Sep 23 '24 13:09 ericcurtin