FastChat
FastChat copied to clipboard
[Chatbot Arena] Add Falcon 40B model
Abu Dhabi's Technology Innovation Institute (TII) just released new 7B and 40B LLMs. The Falcon-40B model is now at the top of the Open LLM Leaderboard, beating llama-30b-supercot and llama-65b among others.
Therefore, I would love to see Falcon 40B model added to the Chatbot Arena and it's Leaderboard!
Model | Revision | Average | ARC (25-shot) | HellaSwag (10-shot) | MMLU (5-shot) | TruthfulQA (0-shot) |
---|---|---|---|---|---|---|
tiiuae/falcon-40b | main | 60.4 | 61.9 | 85.3 | 52.7 | 41.7 |
ausboss/llama-30b-supercot | main | 59.8 | 58.5 | 82.9 | 44.3 | 53.6 |
llama-65b | main | 58.3 | 57.8 | 84.2 | 48.8 | 42.3 |
MetaIX/GPT4-X-Alpasta-30b | main | 57.9 | 56.7 | 81.4 | 43.6 | 49.7 |
Press release: UAE's Technology Innovation Institute Launches Open-Source "Falcon 40B" Large Language Model for Research & Commercial Utilization
The Technology Innovation Institute (TII) in Abu Dhabi has announced its open-source large language model (LLM), the Falcon 40B. With 40 billion parameters, Falcon 40B is the UAE's first large-scale AI model, indicating the country's ambition in the field of AI and its commitment to promote innovation and research.
Unlike most LLMs, which typically only provide non-commercial users access, Falcon 40B is open to both research and commercial usage. The TII has also included the model's weights in the open-source package, which will enhance the model's capabilities and allow for more effective fine-tuning.
In addition to the launch of Falcon 40B, the TII has initiated a call for proposals from researchers and visionaries interested in leveraging the model to create innovative use cases or explore further applications. As a reward for exceptional research proposals, selected projects will receive "training compute power" as an investment, allowing for more robust data analysis and complex modeling. VentureOne, the commercialization arm of ATRC, will provide computational resources for the most promising projects.
TII's Falcon 40B has shown impressive performance since its unveiling in March 2023. When benchmarked using Stanford University’s HELM LLM tool, it used less training compute power compared to other renowned LLMs such as OpenAI's GPT-3, DeepMind's Chinchilla AI, and Google's PaLM-62B.
Those interested in accessing Falcon 40B or proposing use cases can do so through the FalconLLM.TII.ae website. Falcon LLMs open-sourced to date are available under a license built upon the principles of the open-source Apache 2.0 software, permitting a broad range of free use.
Hugging Face links
So I did spend some time adding and running Falcon-7B instruct locally, but the streamed output seems like the model is hallucinating. While on the other hand, when I use model.generate
for the same input, the output is correct. As I am new to the codebase, I'm not completely familiar with why both outputs turn out different.
model.generate
output:
streamed output
This is how I initialize the conversation object:
register_conv_template(
Conversation(
name="falcon",
system="""The conversation between human and AI assistant.""",
roles=("[|Human|]", "[|AI|]"),
messages=(),
offset=0,
sep_style=SeparatorStyle.NO_COLON_SINGLE,
sep="\n",
stop_str=["\n"],
stop_token_ids=[193],
)
)
I am not sure why this is happening. Has this got anything to do with how the outputs are handled and fed back to the model for streaming? Any suggestions?
@OAfzal could the random output be because fastchat is loading the model with dtype float16
rather than bfloat16
?
@timesler So loading the model in float16 and running a forward pass results in the following error:
I did not spend too much time debugging this as I was able to load and run the model in with bfloat16
. To confirm, I am loading the model in bfloat16
for the above results. I also tried loading it with float32, but that did not help either. I have tried the exact same configuration in a notebook with .generate
and it gives the correct output
I tried using model.generate
inside the chatloop
function and the results are correct. So that confirms that the issue is with generate_stream_func
. I will try and inspect it further.
wired output then.
register_conv_template(
Conversation(
name="falcon",
system="""The following is a conversation between a human and an AI assistant named Falcon. The human and the AI assistant take turns chatting. Human statements start with [|Human|] and AI assistant statements start with [|AI|]. The AI assistant always provides responses in as much detail as possible, and in Markdown format. The AI assistant always declines to engage with topics, questions and instructions related to unethical, controversial, or sensitive issues. Complete the transcript in exactly that format.\n""",
roles=("[|Human|]", "[|AI|]"),
messages=(
("[|Human|]", "Hello!"),
("[|AI|]", "Hi!"),
),
offset=2,
sep_style=SeparatorStyle.NO_COLON_SINGLE,
sep="\n",
stop_str="[|Human|]",
stop_token_ids=[193],
)
)
Hey @Trangle
Would you like to maybe provide more context with the code you provide? I did use your template too, but the results look no better.
@OAfzal I just saw that official support has been added for Falcon in https://github.com/huggingface/text-generation-inference, so you may be able to glean some insight there about how to get streaming working
@timesler Ohh that sounds great! I will look that up.
Double thumbs up on this one now that Falcon is fully open source (Apache 2.0). We should aim to focus all our efforts in that direction going forward where possible.
Is anybody working on this? I'd love to try adding Falcon into this library.
Hey @Trangle
Would you like to maybe provide more context with the code you provide? I did use your template too, but the results look no better.
How about this one?
Conversation(
name="falcon",
system='',
roles=("User", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.RWKV,
sep='\n',
sep2="<|endoftext|>",
stop_str='\nUser', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
stop_token_ids=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], # it better only put special tokens here, because tokenizer only remove special tokens
# stop_token_ids=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 5584, 7932, 32250],
)
I tried using
model.generate
inside thechatloop
function and the results are correct. So that confirms that the issue is withgenerate_stream_func
. I will try and inspect it further.
I found the same thing as you did, have you try to fix it? Or you have already made a pull request, but wait to be reviewed?
Best
Hi, could anyone here who successfully run falcon submit a pull request?
model(input_ids=torch.as_tensor([[token]], device=device),use_cache=True, past_key_values=past_key_values,) I found falcon seems not support past_key_values
@ericzhou571 I seem fixed this issue: https://huggingface.co/tiiuae/falcon-40b/discussions/48#64807969bb25a636c9da2cd7
@Tron2016 Hi, is this your fixed version? You can find it at: https://huggingface.co/tiiuae/falcon-40b/discussions/48#6488434b7fe834f5890b69f8 I'm not sure where I should apply this code. Should it be added to the RWmodel file provided by the falcon weight package? I added support for falcon to fastchat in this PR: https://github.com/lm-sys/FastChat/pull/1696/files. However, I created a new file specifically for falcon inference. If your changes can be made in the fastchat code, maybe we can still use the fastchat default generate stream?
@Tron2016 Hi, is this your fixed version? You can find it at: https://huggingface.co/tiiuae/falcon-40b/discussions/48#6488434b7fe834f5890b69f8 I'm not sure where I should apply this code. Should it be added to the RWmodel file provided by the falcon weight package? I added support for falcon to fastchat in this PR: https://github.com/lm-sys/FastChat/pull/1696/files. However, I created a new file specifically for falcon inference. If your changes can be made in the fastchat code, maybe we can still use the fastchat default generate stream?
Yes, the rotatry embeding also has bug and need to be fixed: https://huggingface.co/tiiuae/falcon-7b/discussions/17#64890b51ce7b9a2abe36b762, it should be added to the RWmodel file.
Hi guys, I tried using the new Falcon changes from main but seems like the falcon_generate_stream
doesn't stop generating text when it should. I opened an issue on that: #1793
You can see there that when I started the conversation with "Hello there" and it starts good. However, it seems to keep going generating tokens.
Hi guys, I tried using the new Falcon changes from main but seems like the
falcon_generate_stream
doesn't stop generating text when it should. I opened an issue on that: #1793
Hi, can you provide more detail about that problem?
- which kind of "doesn't stop generating" you face? repeating? or generate a whole conversations?
- what is your model name? do you have "falcon" inside?
- which model you use? falcon-xb or falcon-xb-instruct?
- can you check which conversation template you use? do you use falcon template correctly?
Best
@ericzhou571 Hi :) Thanks for the quick reply:
- It seems like the model continues to generate new tokens, even though it should have stopped. I attached an example in the issue (#1793) where the model started with "Hi there! How can I help you today?" but then it continued talking to itself, i.e "Hi there! How can I help you today?I'm looking ...".
- I'm using
tiiuae/falcon-7b-instruct
- Same as point 3. I'm using
tiiuae/falcon-7b-instruct
. The name should be derived from the path, as I understand 🧐 - I tried to debug it and looks like it uses the correct template (falcon template from the main branch):
@dudulasry I made a mistake, I push a conversation template without system message to the fastchat main repo. If you add a system message, everything goes well
What the status of this? What needs to be done?
Falcon has been supported.