ToolQA icon indicating copy to clipboard operation
ToolQA copied to clipboard

ToolQA, a new dataset to evaluate the capabilities of LLMs in answering challenging questions with external tools. It offers two levels (easy/hard) across eight real-life scenarios.

Results 5 ToolQA issues
Sort by recently updated
recently updated
newest added

Hello,we are trying to replicate your work, but we haven't obtained the results reported in your paper on the GSM8k easy dataset. Upon reviewing the output files, we noticed that...

Thanks for your great work! Are the python codes in **Programmatic Answer Generation** generated by LLM or written by human?

Could you kindly provide the raw data of the coffee dataset? I clicked the link you provided but it shows the page didn't exist. ![Screenshot 2024-08-02 at 20 41 27](https://github.com/user-attachments/assets/189887b2-a2f4-4277-bdae-6c0c757bf010)

i am running the benchmark of coffee-hard, and i noticed a problem of the tabletool.py. questions like this kind: "How much did the coffee price change from 2017-09-11 to 2018-04-03?"...

Here is the sql for the first 10 questions about flight delays. I use the DOT definition where a flight is considered delayed if it is more that 15 after...