guidance
guidance copied to clipboard
Falcon Compatibility Bug
How do I need to initialize the falcon_40b model in order to work with guidance?
My approach is as follows:
from transformers import AutoTokenizer
import guidance
import torch
model = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model
guidance.llm = guidance.llms.Transformers(model=model,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
tokenizer=tokenizer,
eos_token_id=tokenizer.eos_token_id
)
# we can pre-define valid option sets
valid_weapons = ["sword", "axe", "mace", "spear", "bow", "crossbow"]
# define the prompt
character_maker = guidance("""The following is a character profile for an RPG game in JSON format.
```json
{
"id": "{{id}}",
"description": "{{description}}",
"name": "{{gen 'name'}}",
"age": {{gen 'age' pattern='[0-9]+' stop=','}},
"armor": "{{#select 'armor'}}leather{{or}}chainmail{{or}}plate{{/select}}",
"weapon": "{{select 'weapon' options=valid_weapons}}",
"class": "{{gen 'class' temperature=0.99}}",
"mantra": "{{gen 'mantra' temperature=0.7}}",
"strength": {{gen 'strength' pattern='[0-9]+' stop=','}},
"items": [{{#geneach 'items' num_iterations=5 join=', '}}"{{gen 'this' temperature=0.99}}"{{/geneach}}]
}```""")
# generate a character
character_maker(
id="e1f491f7-7ab8-4dac-8c20-c92b5e7d883d",
description="A quick and nimble fighter.",
valid_weapons=valid_weapons,# llm=llama
)
It creates this output (sometimes a few more tokens are generated): The following is a character profile for an RPG game in JSON format.
{
"id": "e1f491f7-7ab8-4dac-8c20-c92b5e7d883d",
"description": "A quick and nimble fighter.",
"name": "Rogue
Until it crashes with this error:
Exception in thread Thread-10:
Traceback (most recent call last):
File "XYZ/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "XYZ/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "XYZ/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "XYZ/lib/python3.9/site-packages/transformers/generation/utils.py", line 1515, in generate
return self.greedy_search(
File "XYZ/lib/python3.9/site-packages/transformers/generation/utils.py", line 2332, in greedy_search
outputs = self(
File "XYZ/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "XYZ/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "XYZ/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py", line 759, in forward
transformer_outputs = self.transformer(
File "XYZ/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "XYZ/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py", line 620, in forward
causal_mask = self._prepare_attn_mask(
File "XYZ/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py", line 539, in _prepare_attn_mask
expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
RuntimeError: The size of tensor a (80) must match the size of tensor b (69) at non-singleton dimension 3
On the other hand, standard generation via the recommended code works without a problem:
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=500,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
check out this issue https://github.com/microsoft/guidance/issues/166, I haven't tried it yet, but the recommendation is to remove "device_map=auto" from the Transformers signature
"device_map=auto" is not a problem with other models from Huggingface for me. The bug is caused in the modelling_RW.py file which needs to be downloaded together with the model in order to run it (therefore the trust_remote_code=True).
Also I wouldn't know how to split the model on multiple GPUs without device mapping. Any hints? Im trying CPU-only atm (to test your hypothesis), but its extremely slow.
Same problem was also reported in #165 (related to other, now solved issue)
Running into this issue as well - funnily enough, I don't run into the tensor errors when using falcon 40b wrapped in a PEFT adapter, though inference is really slow.
Running into this issue as well - funnily enough, I don't run into the tensor errors when using falcon 40b wrapped in a PEFT adapter, though inference is really slow.
Would you mind sharing how you load the model into guidance via PEFT?
I'm confirming this is still an issue. I've updated transformers, re-downloaded Falcon, tried different configs from different community members' suggestions attempting to remove Falcon's broken caching mechanisms, and more.
Still a bug! Is guidance kinda dead then? Any recommendations for other libs, besides lmql?
How to fix it..(。•ˇ‸ˇ•。)…
@kk19990709 falcon is buggy, and is a poor model anymore. Try out Mistral or Zephyr or OpenChat for better quality from a much smaller model. Or Qwen-14B.
Also, the maintainers of this lib have been mostly incommunicado for weeks-months. Check out lmql.