guidance
guidance copied to clipboard
`select` produces different results than `gen`, even though the maximum likelihood answer should be the same (tokenization/token healing issue?)
The bug
I have a minimal reproducible example where I would expect select
and gen
to produce similar results, but they don't. My experimentation suggests maybe a tokenization or token healing issue, but I'm not sure. If the behaviour is expected, it would be useful to have some documentation to better understand why.
To Reproduce
from transformers import AutoModelForCausalLM, AutoTokenizer
import guidance
print("guidance version: ", guidance.__version__)
model_name = "unsloth/llama-3-70b-Instruct-bnb-4bit"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)
messages = [
{"role": "system",
"content": "You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first."},
{"role": "user",
"content": "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
lm += prompt
lm += "{\n "
# lm += guidance.select(["\"author\"", "\"title\""]) + guidance.gen(max_tokens=10)
lm += guidance.gen(max_tokens=10)
print(lm)
With select
, the next output is "title"
("wrong" in a certain sense) while with unconstrained generation the output is "author"
("correct" in a certain sense).
(1) gen
output:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first.<|eot_id|><|start_header_id|>user<|end_header_id|>
The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{
"author": "F. Scott Fitzgerald",
(2) select
output:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first.<|eot_id|><|start_header_id|>user<|end_header_id|>
The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{
"title": "The Great Gatsby",
"author
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): RHEL 9
- Guidance Version (
guidance.__version__
): 0.1.15