outlines icon indicating copy to clipboard operation
outlines copied to clipboard

`generate.regex()` fails to generate regex-constrained text with madlad400-3b-mt T5 model

Open urroxyz opened this issue 1 year ago • 1 comments

Describe the issue as clearly as possible:

I love the Outlines project, especially because it allows for various model configurations while maintaining constraint.

However, I have been unable to use Outlines with my desired model: https://huggingface.co/jbochi/madlad400-3b-mt

While text generation works just fine using this T5 Seq2Seq model, regex does not.

Steps/code to reproduce the bug:

from outlines import models, generate
from transformers import AutoModelForSeq2SeqLM
 
model = models.transformers("jbochi/madlad400-3b-mt", model_class=AutoModelForSeq2SeqLM, device="cpu")
 
print("Model loaded!")
 
# Initialize the generator
generator = generate.regex(model, regex_str=r"hello|goodbye")
 
# Generate a response
answer = generator("<2en> hola")
 
# Print the response
print(answer)

Expected result:

hello

Error message:

RuntimeError: Cannot convert token `X` (XXXXX) to bytes: X

Outlines/Python version information:

I am on the latest version of Outlines -- I installed it using the Repo's GitHub URL.

However, I tested this code with various versions of Outlines and none seemed to help.

I am on MacOS.

Context for the issue:

The main difference between Outlines and other popular libraries such as Microsoft's Guidance is its versatility, like its support for T5 models. However, this model is now completely unusable.

urroxyz avatar Sep 06 '24 03:09 urroxyz

I ran your reproduction script, thanks for informing us about this issue.

Here are some samples of tokens (in byte format) which cause the Error in your model:

  • \xef\xbf\xbd\xe2\x80\x9e
  • \xef\xbf\xbd\xc2\xb4
  • \xe2\x80\x94\xef\xbf\xbd
  • \xef\xbf\xbd\xc2\xbc

(Interestingly, each of these contain the replacement character, \xef\xbf\xbd, perhaps indicating whether the bytes are a prefix or suffix to a complete character)

This seems to be a fourth method by which a tokenizer encodes "incomplete unicode" characters

https://github.com/dottxt-ai/outlines/blob/62b7601/outlines/fsm/regex.py#L910-L915

This issue should be resolved by researching an effective method for standarizing tokenizers representations of incomplete unicode. If there isn't one then we can updated re_llama_byte_token in regex.py

lapp0 avatar Sep 15 '24 02:09 lapp0