llama-cpp-python icon indicating copy to clipboard operation
llama-cpp-python copied to clipboard

The results generated are different from those produced by executing commands with the llama cpp library

Open HengruiZYP opened this issue 9 months ago • 0 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When running DeepSeek-R1 on an arm64 architecture with llama-cpp-python version 0.3.7, I expect the performance and results to be comparable to those achieved using the llama.cpp library.

Current Behavior

When running the DeepSeek-R1 model on an arm64 architecture and calling the create_chat_completion function from the llama-cpp-python library's Llama module, the output differs from the results produced by the ./llama-cli built using the llama.cpp library.

Details:

  • The output from create_chat_completion does not include the <thinking> tag.
  • The responses tend to stop prematurely, not providing complete answers as expected.

Environment and Context

  • Physical (or virtual) hardware you are using:

    • CPU: Cortex-A73 (aarch64 architecture)
  • Operating System:

    • Linux 20.04
  • SDK version:

    • Python: 3.8.10
    • Make: 4.2.1
    • G++: 9.4.0
  • llama-cpp-python package version:

    • 0.3.7

Steps to Reproduce

  1. Install llama-cpp-python with specific CMake arguments:

    CMAKE_ARGS="-DGGML_NATIVE=OFF -DGGML_CROSS_COMPILE=ON" pip install llama-cpp-python
    
  2. Invoke create_chat_completion to generate a response

    def generate(self, prompt, top_k, temp):
        output = self.llm.create_chat_completion(
            prompt,
            top_k=top_k,
            temperature=temp,
            max_tokens=self.max_tokens,
            stop=["Q:", "\n"],
            stream=self.stream,
        )
        return output
    

usage phenomenon

The content following “You:” is my input question,the "Okay ....." is the answer, you can notice that the answer stop prematurely

You: To solve the system of equations x+y=20 and 2x+4y=56. what are the values of x and y?
Okay, so I have this system of equations to solve: x plus y equals 20, and 2x plus 4y equals 56. Hmm, let me see how to approach this. I remember that there are a couple of methods to solve systems like this, like substitution and elimination. Maybe I'll try substitution first since it's often straightforward.

INFO:root:Bot: Okay, so I have this system of equations to solve: x plus y equals 20, and 2x plus 4y equals 56. Hmm, let me see how to approach this. I remember that there are a couple of methods to solve systems like this, like substitution and elimination. Maybe I'll try substitution first since it's often straightforward.
INFO:root:llm generate time: 21.382812 s

HengruiZYP avatar Feb 25 '25 11:02 HengruiZYP