`generate.regex()` fails to generate regex-constrained text with madlad400-3b-mt T5 model
Describe the issue as clearly as possible:
I love the Outlines project, especially because it allows for various model configurations while maintaining constraint.
However, I have been unable to use Outlines with my desired model: https://huggingface.co/jbochi/madlad400-3b-mt
While text generation works just fine using this T5 Seq2Seq model, regex does not.
Steps/code to reproduce the bug:
from outlines import models, generate
from transformers import AutoModelForSeq2SeqLM
model = models.transformers("jbochi/madlad400-3b-mt", model_class=AutoModelForSeq2SeqLM, device="cpu")
print("Model loaded!")
# Initialize the generator
generator = generate.regex(model, regex_str=r"hello|goodbye")
# Generate a response
answer = generator("<2en> hola")
# Print the response
print(answer)
Expected result:
hello
Error message:
RuntimeError: Cannot convert token `X` (XXXXX) to bytes: X
Outlines/Python version information:
I am on the latest version of Outlines -- I installed it using the Repo's GitHub URL.
However, I tested this code with various versions of Outlines and none seemed to help.
I am on MacOS.
Context for the issue:
The main difference between Outlines and other popular libraries such as Microsoft's Guidance is its versatility, like its support for T5 models. However, this model is now completely unusable.
I ran your reproduction script, thanks for informing us about this issue.
Here are some samples of tokens (in byte format) which cause the Error in your model:
\xef\xbf\xbd\xe2\x80\x9e\xef\xbf\xbd\xc2\xb4\xe2\x80\x94\xef\xbf\xbd\xef\xbf\xbd\xc2\xbc
(Interestingly, each of these contain the replacement character, \xef\xbf\xbd, perhaps indicating whether the bytes are a prefix or suffix to a complete character)
This seems to be a fourth method by which a tokenizer encodes "incomplete unicode" characters
https://github.com/dottxt-ai/outlines/blob/62b7601/outlines/fsm/regex.py#L910-L915
This issue should be resolved by researching an effective method for standarizing tokenizers representations of incomplete unicode. If there isn't one then we can updated re_llama_byte_token in regex.py