The results generated are different from those produced by executing commands with the llama cpp library

Open HengruiZYP opened this issue 9 months ago • 0 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When running DeepSeek-R1 on an arm64 architecture with llama-cpp-python version 0.3.7, I expect the performance and results to be comparable to those achieved using the llama.cpp library.

Current Behavior

When running the DeepSeek-R1 model on an arm64 architecture and calling the create_chat_completion function from the llama-cpp-python library's Llama module, the output differs from the results produced by the ./llama-cli built using the llama.cpp library.

Details:

The output from create_chat_completion does not include the <thinking> tag.
The responses tend to stop prematurely, not providing complete answers as expected.

Environment and Context

Physical (or virtual) hardware you are using:
- CPU: Cortex-A73 (aarch64 architecture)
Operating System:
- Linux 20.04
SDK version:
- Python: 3.8.10
- Make: 4.2.1
- G++: 9.4.0
llama-cpp-python package version:
- 0.3.7

Steps to Reproduce

Install llama-cpp-python with specific CMake arguments:

CMAKE_ARGS="-DGGML_NATIVE=OFF -DGGML_CROSS_COMPILE=ON" pip install llama-cpp-python

Invoke create_chat_completion to generate a response

def generate(self, prompt, top_k, temp):
    output = self.llm.create_chat_completion(
        prompt,
        top_k=top_k,
        temperature=temp,
        max_tokens=self.max_tokens,
        stop=["Q:", "\n"],
        stream=self.stream,
    )
    return output

usage phenomenon

The content following “You:” is my input question，the "Okay ....." is the answer, you can notice that the answer stop prematurely

You: To solve the system of equations x+y=20 and 2x+4y=56. what are the values of x and y？
Okay, so I have this system of equations to solve: x plus y equals 20, and 2x plus 4y equals 56. Hmm, let me see how to approach this. I remember that there are a couple of methods to solve systems like this, like substitution and elimination. Maybe I'll try substitution first since it's often straightforward.

INFO:root:Bot: Okay, so I have this system of equations to solve: x plus y equals 20, and 2x plus 4y equals 56. Hmm, let me see how to approach this. I remember that there are a couple of methods to solve systems like this, like substitution and elimination. Maybe I'll try substitution first since it's often straightforward.
INFO:root:llm generate time: 21.382812 s

Feb 25 '25 11:02 HengruiZYP

llama-cpp-python llama-cpp-python copied to clipboard

The results generated are different from those produced by executing commands with the llama cpp library

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

usage phenomenon

llama-cpp-python
llama-cpp-python copied to clipboard