llama-cpp-python
llama-cpp-python copied to clipboard
Not selecting the tokens with the highest probabilities with temperature 0
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the README.md.
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
When running Mistral LLM with a temperature of 0, I expected the model to choose tokens with the highest probabilities exclusively.
Current Behavior
I observed that for some tokens (apparently randomly), it selected the ones with the second-highest probabilities, leading to different outputs.
Environment and Context
I am using Google Colab, T4 GPU with 15 GB VRAM.
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
$ pip list | egrep "pandas|numpy"
geopandas 0.13.2
numpy 1.25.2
pandas 1.5.3
pandas-datareader 0.10.0
pandas-gbq 0.19.2
pandas-stubs 1.5.3.230304
sklearn-pandas 2.2.0
Steps to Reproduce
- Installing llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.57
- Initializing the model
from llama_cpp import Llama
llm = Llama(
model_path="mistral-7b-instruct-v0.2.Q8_0.gguf",
n_ctx= 10000,
n_gpu_layers = -1,
verbose= True,
logits_all = True # true to return logprobs
)
- Running the model and getting the outputs
prompt = """[INST] (((5 + 4)*2)/9)+2? [/INST] """
llm.reset() # to clear the cache
output = llm.create_completion(prompt,
max_tokens = 200,
echo = False,
temperature = 0,
logprobs = 5 ) # to return top 5 tokens
results = output['choices'][0]['text']
print('Results from the model:\n')
print(results)
log_probs = output['choices'][0]['logprobs']['top_logprobs']
print('Results by selecting tokens with the highest probabilities:\n')
for el in log_probs:
chosen = max(el, key=lambda k: el[k])
print(chosen, end = '')
The results of the above two prints are the following which are not exactly the same because of the selected tokens:
Results from the model:
To solve the expression step by step, follow these instructions:
1. First, perform the multiplication inside the parentheses: (5 + 4) * 2 = 9 * 2 = 18.
2. Next, do the division: 18 / 9 = 2.
3. Add 2 to the result: 2 + 2 = 4.
So, the solution to ((5 + 4)*2)/9+2 is 4.
Results by selecting tokens with the highest probabilities:
To solve the expression ((( by step, follow the instructions:
1. First, perform the multiplication inside the parentheses: (5 + 4) * 2 = 9 * 2 = 18.
2. Next, do the division: 18 / 9 = 2.
3. Add 2 to the result: 2 + 2 = 4.
So, the solution to the5 + 4)*2)/9 +2 is 4.
- Coverting log_probs to probabilities
import numpy as np
# convert logits to probs
for dict in log_probs:
for key, value in dict.items():
dict[key] = np.exp(value)
The probabilities show that for example, the model selects the token step
with probability 0.22 instead of the token (((
with probability 0.54.
{' (((': 0.5391272,
' step': 0.22150521,
' ((': 0.15229335,
',': 0.06753055,
'(((': 0.010963313}
As another example, it selects +
with probability 0.0002 instead of ""+
(with leading whitespace) with probability 0.99 in this expression : ((5 + 4)*2)/9+2 is 4
{' +': 0.9997909,
'+': 0.00020875783,
' plus': 2.6573096e-07,
' ±': 2.159923e-08,
' +=': 1.6546656e-08},
I got very similar results with mistral-7b-instruct-v0.2.Q5_K_M.gguf. As an alternative solution, you can get correct, deterministic output by setting top_k=1 instead of using temperature.
I tried top_k=1 but it didn't solve the problem. The results are the same
Try setting temperature<0
, I'll investigate this further but it looks like this stems from
https://github.com/abetlen/llama-cpp-python/blob/909ef66951fb9f3f5c724841b0c98a598a2e0802/llama_cpp/_internals.py#L751-L755
Which matches llama.cpp
but is likely not what we want for logprobs.
I tried temperature <0
but I got the following ValueError:
ValueError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py in _create_completion(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias) 1022 grammar=grammar, 1023 ): -> 1024 if token == self._token_eos: 1025 text = self.detokenize(completion_tokens, prev_tokens=prompt_tokens) 1026 finish_reason = "stop"
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I stumbled upon this issue while debugging a separate issue in a different repository.
After updating several packages, including Guidance (which depends on llama-cpp-python) I noticed my Mistral Instruct responses had suddenly become gibberish.
After some detective work, I determined that the culprit was a change that happened between 0.2.57 and 0.2.58.
It may or may not be related, but try downgrading to 0.2.57 for now, and see if that fixes y'all's issues.
@FoxBuchele If you're a Guidance user, your issue is likely there rather than in llama-cpp-python. It might be fixed by installing Guidance directly from the current main branch on the repo.
In the current Guidance release on PyPi (0.1.13), logits for llama-cpp are accessed with:
# get the logits
logits = llama_cpp.llama_get_logits(self.model_obj.ctx)
logits = logits[(n_tokens - 1) * self._n_vocab : n_tokens * self._n_vocab]
logits = np.ctypeslib.as_array(logits, shape=(self._n_vocab,)).copy()
However, llama-cpp-python's llama_get_logits() return value changed in 0.2.58 to no longer require the second line of this logic. The current Guidance main branch has the following:
# get the logits
logits = llama_cpp.llama_get_logits(self.model_obj.ctx)
if llama_cpp.__version__ < "0.2.58":
logits = logits[(n_tokens - 1) * self._n_vocab : n_tokens * self._n_vocab]
logits = np.ctypeslib.as_array(logits, shape=(self._n_vocab,)).copy()
This fix looks like it will be in a future release but at present Guidance and llama-cpp-python releases are incompatible. If you need to use llama-cpp-python>0.2.57 with Guidance now, making this change via installing from main should sort you out. Note that this definitely isn't the issue for elhambb because they are on 0.2.57.
When running Mistral LLM with a temperature of 0, I expected the model to choose tokens with the highest probabilities exclusively.
I believe that minimizing not just temperature
but also top_p
is required to achieve this (do not set them to exactly zero though - google why this would not work).