guidance Falcon Compatibility Bug

How do I need to initialize the falcon_40b model in order to work with guidance?

My approach is as follows:

from transformers import AutoTokenizer
import guidance
import torch

model = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model

guidance.llm = guidance.llms.Transformers(model=model, 
                                          torch_dtype=torch.bfloat16,
                                          trust_remote_code=True,
                                          device_map="auto",
                                          tokenizer=tokenizer,
                                          eos_token_id=tokenizer.eos_token_id
                                         )

# we can pre-define valid option sets
valid_weapons = ["sword", "axe", "mace", "spear", "bow", "crossbow"]

# define the prompt
character_maker = guidance("""The following is a character profile for an RPG game in JSON format.
```json
{
    "id": "{{id}}",
    "description": "{{description}}",
    "name": "{{gen 'name'}}",
    "age": {{gen 'age' pattern='[0-9]+' stop=','}},
    "armor": "{{#select 'armor'}}leather{{or}}chainmail{{or}}plate{{/select}}",
    "weapon": "{{select 'weapon' options=valid_weapons}}",
    "class": "{{gen 'class' temperature=0.99}}",
    "mantra": "{{gen 'mantra' temperature=0.7}}",
    "strength": {{gen 'strength' pattern='[0-9]+' stop=','}},
    "items": [{{#geneach 'items' num_iterations=5 join=', '}}"{{gen 'this' temperature=0.99}}"{{/geneach}}]
}```""")

# generate a character
character_maker(
    id="e1f491f7-7ab8-4dac-8c20-c92b5e7d883d",
    description="A quick and nimble fighter.",
    valid_weapons=valid_weapons,# llm=llama
)

It creates this output (sometimes a few more tokens are generated): The following is a character profile for an RPG game in JSON format.

{
    "id": "e1f491f7-7ab8-4dac-8c20-c92b5e7d883d",
    "description": "A quick and nimble fighter.",
    "name": "Rogue

Until it crashes with this error:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "XYZ/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "XYZ/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "XYZ/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "XYZ/lib/python3.9/site-packages/transformers/generation/utils.py", line 1515, in generate
    return self.greedy_search(
  File "XYZ/lib/python3.9/site-packages/transformers/generation/utils.py", line 2332, in greedy_search
    outputs = self(
  File "XYZ/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "XYZ/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "XYZ/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py", line 759, in forward
    transformer_outputs = self.transformer(
  File "XYZ/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "XYZ/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py", line 620, in forward
    causal_mask = self._prepare_attn_mask(
  File "XYZ/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b-instruct/4e8f82c2d7468e3d9c88be4f38f531449141b52b/modelling_RW.py", line 539, in _prepare_attn_mask
    expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
RuntimeError: The size of tensor a (80) must match the size of tensor b (69) at non-singleton dimension 3

On the other hand, standard generation via the recommended code works without a problem:

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Jun 06 '23 14:06 hanszahm

check out this issue https://github.com/microsoft/guidance/issues/166, I haven't tried it yet, but the recommendation is to remove "device_map=auto" from the Transformers signature

Jun 06 '23 16:06 chiggly007

"device_map=auto" is not a problem with other models from Huggingface for me. The bug is caused in the modelling_RW.py file which needs to be downloaded together with the model in order to run it (therefore the trust_remote_code=True). Also I wouldn't know how to split the model on multiple GPUs without device mapping. Any hints? Im trying CPU-only atm (to test your hypothesis), but its extremely slow.

Jun 07 '23 09:06 hanszahm

Same problem was also reported in #165 (related to other, now solved issue)

Jun 07 '23 14:06 lckr

Running into this issue as well - funnily enough, I don't run into the tensor errors when using falcon 40b wrapped in a PEFT adapter, though inference is really slow.

Jun 14 '23 17:06 nhuang25

Running into this issue as well - funnily enough, I don't run into the tensor errors when using falcon 40b wrapped in a PEFT adapter, though inference is really slow.

Would you mind sharing how you load the model into guidance via PEFT?

Jun 15 '23 07:06 hanszahm

I'm confirming this is still an issue. I've updated transformers, re-downloaded Falcon, tried different configs from different community members' suggestions attempting to remove Falcon's broken caching mechanisms, and more.

Still a bug! Is guidance kinda dead then? Any recommendations for other libs, besides lmql?

Sep 26 '23 05:09 freckletonj

How to fix it..(｡•ˇ‸ˇ•｡)…

Nov 06 '23 07:11 kk19990709

@kk19990709 falcon is buggy, and is a poor model anymore. Try out Mistral or Zephyr or OpenChat for better quality from a much smaller model. Or Qwen-14B.

Also, the maintainers of this lib have been mostly incommunicado for weeks-months. Check out lmql.

Nov 07 '23 20:11 freckletonj

guidance guidance copied to clipboard

Falcon Compatibility Bug

guidance
guidance copied to clipboard