AgentBench
AgentBench copied to clipboard
[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper)
Bug / Assistance Description The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports
To Reproduce See screenshots below. Your results in HH column indicate 16% success for text-davinci-002 or gpt-3.5-turbo. However, the reults using text-davinci-002 on ReAct indicate 78% (second screenshot). This is a significant difference.
Screenshots or Terminal Copy&Paste
Concrete Questions / Actions: Please tell us:
- How your evaluation for Alfworld (HH) differs from ReAct?
- Which exact model you used?
- Which prompts you used (1-shot, 2-shot), and are they the same as from the ReAct paper?
- Why are the results so different?