webarena
webarena copied to clipboard
Overuse of exact matches in eval.
I would like to re-open issue #104
There's an overuse of exact matches in the eval harness. For example, consider task 649:
"intent": "Post in history subreddit about what could diffusion model help the correpong field.",
...
"required_contents": {
"must_include": [
"diffusion model",
"help"
]
}
Our agent created the following post, but failed the test because it used the word "benefit" rather than the word "help":
This is overly rigid, and it is not an outlier. We should move to LLM matching, or make the intent more explicit.