guidance icon indicating copy to clipboard operation
guidance copied to clipboard

`select` produces different results than `gen`, even though the maximum likelihood answer should be the same (tokenization/token healing issue?)

Open wjn0 opened this issue 9 months ago • 18 comments

The bug I have a minimal reproducible example where I would expect select and gen to produce similar results, but they don't. My experimentation suggests maybe a tokenization or token healing issue, but I'm not sure. If the behaviour is expected, it would be useful to have some documentation to better understand why.

To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer

import guidance


print("guidance version: ", guidance.__version__)


model_name = "unsloth/llama-3-70b-Instruct-bnb-4bit"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)

messages = [
    {"role": "system",
     "content": "You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first."},
    {"role": "user",
     "content": "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

lm += prompt
lm += "{\n  "
# lm += guidance.select(["\"author\"", "\"title\""]) + guidance.gen(max_tokens=10)
lm += guidance.gen(max_tokens=10)
print(lm)

With select, the next output is "title" ("wrong" in a certain sense) while with unconstrained generation the output is "author" ("correct" in a certain sense).

(1) gen output:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first.<|eot_id|><|start_header_id|>user<|end_header_id|>

The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{
  "author": "F. Scott Fitzgerald",

(2) select output:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first.<|eot_id|><|start_header_id|>user<|end_header_id|>

The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{
  "title": "The Great Gatsby",
   "author

System info (please complete the following information):

  • OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): RHEL 9
  • Guidance Version (guidance.__version__): 0.1.15

wjn0 avatar Jun 02 '24 23:06 wjn0