llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Use `llama_chat_apply_template` in `main` (WIP)

Open ngxson opened this issue 3 months ago • 0 comments

Resolve #6391

The core idea is to use llama_chat_apply_template to apply the template twice: with and without the last user message. Then, we find the diff between 2 output strings and finally feed it into inference.

Example:

<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
Who are you<end_of_turn>
<start_of_turn>model
I am an assistant<end_of_turn>
<start_of_turn>user
Another question<end_of_turn>
<start_of_turn>model

-----
chat_get_added_part(): <start_of_turn>user
Another question<end_of_turn>
<start_of_turn>model

This approach will require minimal effort to maintain the chat template infrastructure, while using the extract same logic for main and server (remind: server also have the notion of "prompt cache" which works the same way)

Having to re-format the whole chat history each time seems inefficient at first glance, but it is needed because:

  • There're some edge cases, see: https://github.com/ggerganov/llama.cpp/issues/6391#issuecomment-2068131134
  • That's the same logic with server (which is designed to be stateless)

Then, we find the diff between the 2 strings.

  • [x] Implement chat_get_added_part to get the diff part with / without the last user message
  • [ ] main must keep track of the list of messages
  • [ ] Update arguments for main, deprecate -cml (but not remove it) while adding -chat-template argument

ngxson avatar Apr 21 '24 16:04 ngxson