ToolQA issues

About the evaluation

Hello,we are trying to replicate your work, but we haven't obtained the results reported in your paper on the GSM8k easy dataset. Upon reviewing the output files, we noticed that...

zhangzhen-research

About Programmatic Answer Generation Part

1

Thanks for your great work! Are the python codes in **Programmatic Answer Generation** generated by LLM or written by human?

yc1999

Could you kindly provide the raw data of the coffee dataset? I clicked the link you provided but it shows the page didn't exist. ![Screenshot 2024-08-02 at 20 41 27](https://github.com/user-attachments/assets/189887b2-a2f4-4277-bdae-6c0c757bf010)

xschen-beb

problems of tabletools.py

i am running the benchmark of coffee-hard, and i noticed a problem of the tabletool.py. questions like this kind: "How much did the coffee price change from 2017-09-11 to 2018-04-03?"...

iolingl

Answers for flight hard dataset appear to be incorrect

Here is the sql for the first 10 questions about flight delays. I use the DOT definition where a flight is considered delayed if it is more that 15 after...

dean-stanford

ToolQA
ToolQA copied to clipboard

Metadata

About the evaluation

About Programmatic Answer Generation Part

Coffee dataset

problems of tabletools.py

Answers for flight hard dataset appear to be incorrect

← Metadata

Owner

Metadata

ToolQA ToolQA copied to clipboard

Metadata

About the evaluation

About **Programmatic Answer Generation** Part

Coffee dataset

problems of tabletools.py

Answers for flight hard dataset appear to be incorrect

← Metadata

Owner

Metadata

ToolQA
ToolQA copied to clipboard

About Programmatic Answer Generation Part