Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"
Name and Version
the latest version of llama.cpp
Operating systems
Windows
GGML backends
CPU
Hardware
CPU 16 GB RAM Intel I5 core 10th Gen
Models
Flan T5 Large
Problem description & steps to reproduce
I have a finetuned Flan T5 Model in my local which I have quantized and converted to gguf format using llama.cpp using the below line of command:
!python {path to convert_hf_to_gguf.py} {path to hf_model} --outfile {name_of_outputfile.gguf} --outtype {quantization type}
and loaded the gguf file using llama.cpp Llama
from llama_cpp import Llama gguf_model_path = "t5_8bit.gguf" model = Llama(model_path=gguf_model_path)
and when trying to use the model at inference in Jupyter Notebook, the kernel is Dying. When tried the same in Command Prompt, getting the aasertion issue "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"
used the below code for inference in CPU and the issue is detected at model.eval()
Code: prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"
tokens = model.tokenize(prompt.encode()) output_tokens = model.eval(tokens) output = model.detokenize(tokens) print(output)
Why is this issue coming, and what is the solution to it. I am trying to use quantized models in the Local for inference
First Bad Commit
No response
Relevant log output
from llama_cpp import Llama
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)
prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"
tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)
output:
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed