llama-cpp-python
llama-cpp-python copied to clipboard
Crashing with "IndexError: index 200 is out of bounds for axis 0 with size 200"
Installed from conda environment with pip Version: '0.2.69' The code is as follow:
llm = Llama(
model_path="/data/codelama-2024-02/CodeLlama-7b-Python/ggml-model-f16.gguf",
seed=1023, # Uncomment to set a specific seed
n_ctx=200, # Uncomment to increase the context window
n_batch=200,
verbose=True,
)
llm(" Your task is to write a Python script that loads from CSV file ",
max_tokens=1024, echo=True)
Error message
File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:1588](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=1587), in Llama.__call__(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias)
1524 def __call__(
1525 self,
1526 prompt: str,
(...)
1550 logit_bias: Optional[Dict[str, float]] = None,
1551 ) -> Union[CreateCompletionResponse, Iterator[CreateCompletionStreamResponse]]:
1552 """Generate text from a prompt.
1553
1554 Args:
(...)
1586 Response object containing the generated text.
1587 """
-> 1588 return self.create_completion(
1589 prompt=prompt,
1590 suffix=suffix,
1591 max_tokens=max_tokens,
1592 temperature=temperature,
1593 top_p=top_p,
1594 min_p=min_p,
1595 typical_p=typical_p,
1596 logprobs=logprobs,
1597 echo=echo,
1598 stop=stop,
1599 frequency_penalty=frequency_penalty,
1600 presence_penalty=presence_penalty,
1601 repeat_penalty=repeat_penalty,
1602 top_k=top_k,
1603 stream=stream,
1604 seed=seed,
1605 tfs_z=tfs_z,
1606 mirostat_mode=mirostat_mode,
1607 mirostat_tau=mirostat_tau,
1608 mirostat_eta=mirostat_eta,
1609 model=model,
1610 stopping_criteria=stopping_criteria,
1611 logits_processor=logits_processor,
1612 grammar=grammar,
1613 logit_bias=logit_bias,
1614 )
File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:1521](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=1520), in Llama.create_completion(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias)
1519 chunks: Iterator[CreateCompletionStreamResponse] = completion_or_chunks
1520 return chunks
-> 1521 completion: Completion = next(completion_or_chunks) # type: ignore
1522 return completion
File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:1046](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=1045), in Llama._create_completion(self, prompt, suffix, max_tokens, temperature, top_p, min_p, typical_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, seed, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor, grammar, logit_bias)
1044 finish_reason = "length"
1045 multibyte_fix = 0
-> 1046 for token in self.generate(
1047 prompt_tokens,
1048 top_k=top_k,
1049 top_p=top_p,
1050 min_p=min_p,
1051 typical_p=typical_p,
1052 temp=temperature,
1053 tfs_z=tfs_z,
1054 mirostat_mode=mirostat_mode,
1055 mirostat_tau=mirostat_tau,
1056 mirostat_eta=mirostat_eta,
1057 frequency_penalty=frequency_penalty,
1058 presence_penalty=presence_penalty,
1059 repeat_penalty=repeat_penalty,
1060 stopping_criteria=stopping_criteria,
1061 logits_processor=logits_processor,
1062 grammar=grammar,
1063 ):
1064 assert self._model.model is not None
1065 if llama_cpp.llama_token_is_eog(self._model.model, token):
File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:709](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=708), in Llama.generate(self, tokens, top_k, top_p, min_p, typical_p, temp, repeat_penalty, reset, frequency_penalty, presence_penalty, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, penalize_nl, logits_processor, stopping_criteria, grammar)
707 # Eval and sample
708 while True:
--> 709 self.eval(tokens)
710 while sample_idx < self.n_tokens:
711 token = self.sample(
712 top_k=top_k,
713 top_p=top_p,
(...)
727 idx=sample_idx,
728 )
File [~/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py:560](http://kaui:8888/home/hvu/anaconda3/envs/pytorch_py39_cu11.8/lib/python3.9/site-packages/llama_cpp/llama.py#line=559), in Llama.eval(self, tokens)
558 cols = self._n_vocab
559 logits = self._ctx.get_logits()[: rows * cols]
--> 560 self.scores[n_past + n_tokens - 1, :].reshape(-1)[: :] = logits
561 # Update n_tokens
562 self.n_tokens += n_tokens
IndexError: index 200 is out of bounds for axis 0 with size 200
I am having the same issue
Problem with max_tokens less than n_ctx. I think we need to add an assert to ensure context bigger than generated text size.
Problem with
max_tokensless thann_ctx. I think we need to add an assert to ensure context bigger than generated text size.
max_tokens is 1024 while n_ctx is 200 in the provided example though. Do you mean that n_ctx should be greater than the actual context window + the output tokens?
Problem with
max_tokensless thann_ctx. I think we need to add an assert to ensure context bigger than generated text size.
max_tokensis 1024 whilen_ctxis 200 in the provided example though. Do you mean that n_ctx should be greater than the actual context window + the output tokens?
Yes, greater, not less. My mistake.
Did you try increasing the max output tokens (in this test, try setting it to eg 20k or so, just to be sure). Does this solve the issue? As long as the input does not exceed the context (which should error out) I don't think the context is involved.
I think this bug happens when the input is smaller than n_ctx, but the input + output is greater than n_ctx.
I never hit this error until upgrading llama_cpp_python.
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/langchain_community/llms/llamacpp.py", line 341, in _stream
for part in result:
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/llama_cpp/llama.py", line 1208, in _create_completion
for token in self.generate(
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/llama_cpp/llama.py", line 800, in generate
self.eval(tokens)
File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/llama_cpp/llama.py", line 650, in eval
self.scores[n_past + n_tokens - 1, :].reshape(-1)[::] = logits
IndexError: index 3000 is out of bounds for axis 0 with size 3000
Happens even though llama_new_context_with_model: n_ctx = 3008
Did you find a solution for this issue, I'm having the same once I upgraded llama_cpp_python
Unfortunately no! I am using the c++ binding instead and the Hermes models which doesn't seem to work better.