AgentBench [Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper)

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper)

Open ai-nikolai opened this issue 11 months ago • 4 comments

Bug / Assistance Description The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports

To Reproduce See screenshots below. Your results in HH column indicate 16% success for text-davinci-002 or gpt-3.5-turbo. However, the reults using text-davinci-002 on ReAct indicate 78% (second screenshot). This is a significant difference.

Screenshots or Terminal Copy&Paste AgentBench

Concrete Questions / Actions: Please tell us:

How your evaluation for Alfworld (HH) differs from ReAct?
Which exact model you used?
Which prompts you used (1-shot, 2-shot), and are they the same as from the ReAct paper?
Why are the results so different?

Mar 09 '24 14:03 ai-nikolai

AgentBench AgentBench copied to clipboard

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper)

AgentBench
AgentBench copied to clipboard