Prerequisites

Please answer the following questions for yourself before submitting an issue.

[x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When running Mistral LLM with a temperature of 0, I expected the model to choose tokens with the highest probabilities exclusively.

Current Behavior

I observed that for some tokens (apparently randomly), it selected the ones with the second-highest probabilities, leading to different outputs.

Environment and Context

I am using Google Colab, T4 GPU with 15 GB VRAM.

$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
$ pip list | egrep "pandas|numpy"
geopandas                        0.13.2
numpy                            1.25.2
pandas                           1.5.3
pandas-datareader                0.10.0
pandas-gbq                       0.19.2
pandas-stubs                     1.5.3.230304
sklearn-pandas                   2.2.0

Steps to Reproduce

Installing llama-cpp-python

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.57

Initializing the model

from llama_cpp import Llama
llm = Llama(
  model_path="mistral-7b-instruct-v0.2.Q8_0.gguf",  
  n_ctx= 10000,                     
  n_gpu_layers = -1,        
  verbose= True,
  logits_all = True       # true to return logprobs
  )

Running the model and getting the outputs

prompt = """[INST] (((5 + 4)*2)/9)+2? [/INST] """

llm.reset()         # to clear the cache
output = llm.create_completion(prompt,
            max_tokens = 200,
            echo = False,
            temperature = 0,
            logprobs = 5 )      # to return top 5 tokens

results = output['choices'][0]['text']
print('Results from the model:\n')
print(results)

log_probs = output['choices'][0]['logprobs']['top_logprobs']
print('Results by selecting tokens with the highest probabilities:\n')
for el in log_probs:
    chosen = max(el, key=lambda k: el[k])
    print(chosen, end = '')

The results of the above two prints are the following which are not exactly the same because of the selected tokens:

Results from the model:

        To solve the expression step by step, follow these instructions:

        1. First, perform the multiplication inside the parentheses: (5 + 4) * 2 = 9 * 2 = 18.
        2. Next, do the division: 18 / 9 = 2.
        3. Add 2 to the result: 2 + 2 = 4.

        So, the solution to ((5 + 4)*2)/9+2 is 4.

Results by selecting tokens with the highest probabilities:

        To solve the expression ((( by step, follow the instructions:

        1. First, perform the multiplication inside the parentheses: (5 + 4) * 2 = 9 * 2 = 18.
        2. Next, do the division: 18 / 9 = 2.
        3. Add 2 to the result: 2 + 2 = 4.

        So, the solution to the5 + 4)*2)/9 +2 is 4.

Coverting log_probs to probabilities

import numpy as np
# convert logits to probs
for dict in log_probs:
    for key, value in dict.items():
        dict[key] = np.exp(value)

The probabilities show that for example, the model selects the token step with probability 0.22 instead of the token ((( with probability 0.54.

{' (((': 0.5391272,
  ' step': 0.22150521,
  ' ((': 0.15229335,
  ',': 0.06753055,
  '(((': 0.010963313}

As another example, it selects + with probability 0.0002 instead of ""+ (with leading whitespace) with probability 0.99 in this expression : ((5 + 4)*2)/9+2 is 4

{' +': 0.9997909,
  '+': 0.00020875783,
  ' plus': 2.6573096e-07,
  ' ±': 2.159923e-08,
  ' +=': 1.6546656e-08},

Apr 03 '24 10:04 elhambb

I got very similar results with mistral-7b-instruct-v0.2.Q5_K_M.gguf. As an alternative solution, you can get correct, deterministic output by setting top_k=1 instead of using temperature.

Apr 03 '24 11:04 zeknife

I tried top_k=1 but it didn't solve the problem. The results are the same

Apr 04 '24 07:04 elhambb

Try setting temperature<0, I'll investigate this further but it looks like this stems from

https://github.com/abetlen/llama-cpp-python/blob/909ef66951fb9f3f5c724841b0c98a598a2e0802/llama_cpp/_internals.py#L751-L755

Which matches llama.cpp but is likely not what we want for logprobs.

Apr 04 '24 14:04 abetlen

I tried temperature <0 but I got the following ValueError:

ValueError Traceback (most recent call last)

in <cell line: 2>() 1 llm.reset() # to clear the cache - the model caches the input prompt and the next runnings will be different from the first one ----> 2 output = llm.create_completion(prompt, 3 max_tokens = 200, 4 echo = False, 5 temperature = -0.3,

1 frames

/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py in _create_completion(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias) 1022 grammar=grammar, 1023 ): -> 1024 if token == self._token_eos: 1025 text = self.detokenize(completion_tokens, prev_tokens=prompt_tokens) 1026 finish_reason = "stop"

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Apr 05 '24 13:04 elhambb

I stumbled upon this issue while debugging a separate issue in a different repository.

After updating several packages, including Guidance (which depends on llama-cpp-python) I noticed my Mistral Instruct responses had suddenly become gibberish.

After some detective work, I determined that the culprit was a change that happened between 0.2.57 and 0.2.58.

It may or may not be related, but try downgrading to 0.2.57 for now, and see if that fixes y'all's issues.

Apr 06 '24 17:04 FoxBuchele

@FoxBuchele If you're a Guidance user, your issue is likely there rather than in llama-cpp-python. It might be fixed by installing Guidance directly from the current main branch on the repo.

In the current Guidance release on PyPi (0.1.13), logits for llama-cpp are accessed with:

# get the logits
logits = llama_cpp.llama_get_logits(self.model_obj.ctx)
logits = logits[(n_tokens - 1) * self._n_vocab : n_tokens * self._n_vocab]
logits = np.ctypeslib.as_array(logits, shape=(self._n_vocab,)).copy()

However, llama-cpp-python's llama_get_logits() return value changed in 0.2.58 to no longer require the second line of this logic. The current Guidance main branch has the following:

# get the logits
logits = llama_cpp.llama_get_logits(self.model_obj.ctx)
if llama_cpp.__version__ < "0.2.58":
    logits = logits[(n_tokens - 1) * self._n_vocab : n_tokens * self._n_vocab]
logits = np.ctypeslib.as_array(logits, shape=(self._n_vocab,)).copy()

This fix looks like it will be in a future release but at present Guidance and llama-cpp-python releases are incompatible. If you need to use llama-cpp-python>0.2.57 with Guidance now, making this change via installing from main should sort you out. Note that this definitely isn't the issue for elhambb because they are on 0.2.57.

May 08 '24 13:05 oj-sec

When running Mistral LLM with a temperature of 0, I expected the model to choose tokens with the highest probabilities exclusively.

I believe that minimizing not just temperature but also top_p is required to achieve this (do not set them to exactly zero though - google why this would not work).

May 17 '24 17:05 mirekphd

llama-cpp-python
llama-cpp-python copied to clipboard

Not selecting the tokens with the highest probabilities with temperature 0

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

llama-cpp-python llama-cpp-python copied to clipboard

Not selecting the tokens with the highest probabilities with temperature 0

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

llama-cpp-python
llama-cpp-python copied to clipboard