webarena Overuse of exact matches in eval.

Overuse of exact matches in eval.

Open afourney opened this issue 3 months ago • 7 comments

I would like to re-open issue #104

There's an overuse of exact matches in the eval harness. For example, consider task 649:

 "intent": "Post in history subreddit about what could diffusion model help the correpong field.",
...
          "required_contents": {
            "must_include": [
              "diffusion model",
              "help"
            ]
          }

Our agent created the following post, but failed the test because it used the word "benefit" rather than the word "help":

This is overly rigid, and it is not an outlier. We should move to LLM matching, or make the intent more explicit.

Apr 24 '24 05:04 afourney

webarena webarena copied to clipboard

Overuse of exact matches in eval.

webarena
webarena copied to clipboard