llama-cpp-python
llama-cpp-python copied to clipboard
Fix v1/chat/completions Gibberish API Responses
The chat completion api specifically in fastapi wasn't doing a very consistent job in completing chat. The results seem to consistently generate gibberish (like \nA\n/imagine prompt: User is asking about
, or just referencing to the system message in general), so I went ahead and tweaked the prompt (it was also weirdly formatted which probably confused the text generation even more).
Here it is before and after with the default example (running vicuna-13B unfiltered
:
Before Prompt
### Instructions:Complete the following chat conversation between the user and the assistant. System messages should be strictly followed as additional instructions.
### Inputs:system None: You are a helpful assistant.
user None: What is the capital of France?
### Response:
assistant:
Results
{
"id": "chatcmpl-8d9ce5a6-841d-4568-acbe-67ea9640954a",
"object": "chat.completion",
"created": 1680854923,
"model": "../llama.cpp/models/vicuna/13B/ggml-vicuna-unfiltered-13b-4bit.bin",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\nA\n/imagine prompt: User is asking about the capital of France, Assistant should provide a clear and concise answer, perhaps mentioning some interesting facts about the city or its history. The response should be friendly and helpful, using positive language and encouraging further questions. It should also include some basic information about Paris, such as its location in the north of France, its famous landmarks or cultural attractions, or its population and history.\n\n"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 70,
"completion_tokens": 98,
"total_tokens": 168
}
}
After Prompt
### Instructions:
Complete the following chat conversation between the user and the assistant. System messages should be strictly followed as additional instructions.
system None: You are a helpful assistant.
user None: What is the capital of France?
### Response:
assistant:
Results
{
"id": "chatcmpl-35a2850c-e9cd-445b-ad63-046cb98cb107",
"object": "chat.completion",
"created": 1680854743,
"model": "../llama.cpp/models/vicuna/13B/ggml-vicuna-unfiltered-13b-4bit.bin",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " The capital of France is Paris.\n"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 61,
"completion_tokens": 12,
"total_tokens": 73
}
}
I also followed the general guidance around default parameters for chatting in https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ to help with results as well.
Also added some .gitignore things that were specific to macOS that helps with contributing.
@keldenl any hints where one could find an unfiltered vicuna grazing? asking for a friend ...
@keldenl any hints where one could find an unfiltered vicuna grazing? asking for a friend ...
hug.. some.. faces?
Thanks for the contribution I'll try to address this in a more general way with https://github.com/abetlen/llama-cpp-python/issues/17 by allowing you to load multiple models and set defaults based on the specific model
Also, I haven't tested out the vicuna model yet but it looks very promising, I've found using alpaca for chat is less than ideal.
Vicuña has given me some good results. I've tweaked the chat-ui (chatgpt clone with open ai api) and been able to run the fast api against it! the chat is pretty good other than the slower generation due to lack of chat mode :/
@keldenl awesome, yeah now that the mac install bugs are fixed improving chat speed is definitely next on my list
lmk if i can help in parallel in any way 😀
Related to this - currently the completion prompt returns gibberish if the system prompt "You are a helpful assistant." is not set. It would be great if this could be omitted, similar to the actual OpenAI API.
Update?
i think the issue is you now need to specify the chat_format correctly ... it won't guess anymore.
@earonesty correct, this is all handled correctly now by the chat format and chat handler APIs.