FollowBench
FollowBench copied to clipboard
Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
Thank you for proposing this interesting benchmark. After finishing the **Model Inference** and **LLM-based Evaluation**, we tried to obtain the results as shown in **Merge Evaluation and Save Results**. However,...
Awesome work! Will you release the code that creates your eval dataset because it's quite complicated from your description in the paper. Using this method to generate more data for...
Hello, may I ask a question? Why does the example not have a prompt? It only performs rule-based evaluation?
Hello, I have a question: After I executed model_inference.py and got the results, do I need to use my own model to infer all the questions before executing llm_eval.py? What...
I just run below code and find that the examples need to be evaluated by LLM are not equivalent to your papers. `rule_based_source = ["E2E", "WIKIEVENTS", "CONLL2003", "text_editing", "cnn_dailymail", "xsum",...
In addition to constraint **‘example’,** is only gpt4 used for the evaluation of other constraint_type? Or are other models evaluated using both rule_based and gpt? 想请问一下这里面除了example以外的其他constraint_type的评估是只用了gpt4吗?还是说其他模型的评估既要用rule_based还要用gpt,双重打分?