llama.cpp server : improvements and maintenance

server : improvements and maintenance

Open ggerganov opened this issue 7 months ago • 103 comments

The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete

[x] Support chat templates We need to have separation between the user input and the special tokens, so that the tokenization is performed correctly. See the following comments / commits for more context: https://github.com/ggerganov/llama.cpp/pull/4160#discussion_r1403675264 https://github.com/ggerganov/llama.cpp/pull/4198/commits/c544faed749240fe5eac2bc042087c71f79a0728 https://github.com/ggerganov/llama.cpp/pull/4160#issuecomment-1824984718

We already support extracting meta information from the GGUF model files that can provide the chat template for the specific model: https://github.com/ggerganov/llama.cpp/pull/4125 Support chat template for /v1/chat/completions: https://github.com/ggerganov/llama.cpp/pull/5593 List of supported templates: view on wiki

Supporting this in server would require changes both in the backend and the frontend
[x] Likely redundant logic for OpenAI (OAI) compatibility that should be removed https://github.com/ggerganov/llama.cpp/pull/4198#discussion_r1404500731
[x] Use multiple mount points for the OAI API https://github.com/ggerganov/llama.cpp/blob/af19d3573481d409b3c4e55494810eb1f65a9aae/examples/server/server.cpp#L2682-L2684 https://github.com/ggerganov/llama.cpp/pull/5722
[x] Return meaningful errors on KV cache overflow https://github.com/ggerganov/llama.cpp/issues/4185#issuecomment-1825721736
[x] Refactor the code With the recent additions for parallel decoding support for multiple clients and LLaVA, I feel the code base became very cumbersome and there is a lot of room for refactoring and improving the code. There should be some effort dedicated to cleaning up things and simplifying the code. https://github.com/ggerganov/llama.cpp/pull/5065 https://github.com/ggerganov/llama.cpp/pull/5710
[x] Batched decoding endpoint? Although we added parallel decoding support via "slots", we are still lacking batched decoding where a single client could pass an array of prompts to be completed. Or alternatively, generate multiple completions for a single prompt. Would be useful to support this use case https://github.com/ggerganov/llama.cpp/issues/3478#issuecomment-1822010431
[ ] Tool calls (function calling) Support for MeetKai/functionary model by implementing OpenAI-compatible tool calls to chat endpoint. https://github.com/ggerganov/llama.cpp/pull/5695
[ ] Multimodal support Support has been temporary dropped in #5882, before working in server, we should improve llava-cli and the API for using LLaVA
- #6027
- https://github.com/ggerganov/llama.cpp/pull/5882#issuecomment-1980713874
- https://github.com/ggerganov/llama.cpp/pull/5882#issuecomment-1991583459
- #5896
- #5592
- #6226
[ ] Prompt processing improvment
- #6586
- #6607
[ ] Server production readiness
- https://github.com/ggerganov/llama.cpp/discussions/6398
- #6546

This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.

Have a look to issues labelled with server/webui.

Nov 25 '23 09:11 ggerganov

llama.cpp llama.cpp copied to clipboard

server : improvements and maintenance

llama.cpp
llama.cpp copied to clipboard