Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"

Open Vedapani0402 opened this issue 10 months ago • 1 comments

Name and Version

the latest version of llama.cpp

Operating systems

Windows

GGML backends

CPU

Hardware

CPU 16 GB RAM Intel I5 core 10th Gen

Models

Flan T5 Large

Problem description & steps to reproduce

I have a finetuned Flan T5 Model in my local which I have quantized and converted to gguf format using llama.cpp using the below line of command:

!python {path to convert_hf_to_gguf.py} {path to hf_model} --outfile {name_of_outputfile.gguf} --outtype {quantization type}

and loaded the gguf file using llama.cpp Llama

from llama_cpp import Llama gguf_model_path = "t5_8bit.gguf" model = Llama(model_path=gguf_model_path)

and when trying to use the model at inference in Jupyter Notebook, the kernel is Dying. When tried the same in Command Prompt, getting the aasertion issue "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"

used the below code for inference in CPU and the issue is detected at model.eval()

Code: prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"

tokens = model.tokenize(prompt.encode()) output_tokens = model.eval(tokens) output = model.detokenize(tokens) print(output)

Why is this issue coming, and what is the solution to it. I am trying to use quantized models in the Local for inference

First Bad Commit

No response

Relevant log output

from llama_cpp import Llama 
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)

prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"


tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)


output: 
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed

Feb 26 '25 10:02 Vedapani0402