visualwebarena
visualwebarena copied to clipboard
Documentation, WebArena 2.0, Evaluation Cache
Documentation
Update README files for compatibility with both WebArena (WA) and VisualWebArena (VWA).
WebArena 2.0
WebArena 2.0 addresses annotation issues reported by various users. Specifically:
- WebArena 2.0 minimizes the use of
exact_match
andmust_include
for information-seeking tasks withStringEvaluator
. The migration from old evaluators to new ones generally follows these rules:
- exact_match -> fuzzy_exact_match
- must_include, fuzzy_match
- If the list contains 1 item -> fuzzy_exact_match
- If the list contains > 1 item
- For elements on the same level, same topic -> fuzzy_must_include
- For elements on different aspects -> context_qa
- na -> fuzzy_na_match, which explicitly evaluates the reasoning behind unachievable outcomes.
- Reddit post-related -> qa.
- `context_qa` evaluates content based on both intent and answer.
- `qa` evaluates based only on the answer, as the intent is not relevant.
The prompts are tested in evaluation_harness/eval_evaluators
.
- Other fixes
**Fix from github issues***
https://github.com/web-arena-x/webarena/issues/100
2: product type is very vague. Removed
3: update the intent to indicate tied rank
4: update the intent to indicate tied rank
5: type is too vague, add the scope
https://github.com/web-arena-x/webarena/issues/135
45: update the intent to be more accurate
https://github.com/web-arena-x/webarena/issues/137
425: update the intent to be more accurate
**Individual fix**
Template 324, remove ranking requirement.
Template 204: Use a combination of context_qa and must_include.
792, 793 were deleted because the reason is not very sound
Fix errors found by THU group [THU-Webarena-lite Bug Fixing](https://docs.google.com/spreadsheets/d/13BRuRlU_Z_UBcucjQ5myvrRdB0P0ID3Nj-dWlzawuYo/edit#gid=1021875443)
**Typo, grammar**
by far -> so far
https://github.com/web-arena-x/webarena/issues/133
correpong -> corresponding
telll -> tell
canlled -> cancelled
what could -> how could
competative -> competitive
Evaluator
Support result cache so that evaluation can be run offline. This is helpful if we accept submissions in the future. The participants only needs to upload their cached files and we can perform evaluation quickly without reruning their models