visualwebarena Documentation, WebArena 2.0, Evaluation Cache

Documentation, WebArena 2.0, Evaluation Cache

Open shuyanzhou opened this issue 5 months ago • 1 comments

Documentation

Update README files for compatibility with both WebArena (WA) and VisualWebArena (VWA).

WebArena 2.0

WebArena 2.0 addresses annotation issues reported by various users. Specifically:

WebArena 2.0 minimizes the use of exact_match and must_include for information-seeking tasks with StringEvaluator. The migration from old evaluators to new ones generally follows these rules:

- exact_match -> fuzzy_exact_match
- must_include, fuzzy_match
    - If the list contains 1 item -> fuzzy_exact_match
    - If the list contains > 1 item
       - For elements on the same level, same topic -> fuzzy_must_include
       - For elements on different aspects -> context_qa
- na -> fuzzy_na_match, which explicitly evaluates the reasoning behind unachievable outcomes.
- Reddit post-related -> qa. 
    - `context_qa` evaluates content based on both intent and answer.
    - `qa` evaluates based only on the answer, as the intent is not relevant.

The prompts are tested in evaluation_harness/eval_evaluators.

Other fixes

**Fix from github issues***
https://github.com/web-arena-x/webarena/issues/100
2: product type is very vague. Removed
3: update the intent to indicate tied rank
4: update the intent to indicate tied rank
5: type is too vague, add the scope

https://github.com/web-arena-x/webarena/issues/135
45: update the intent to be more accurate

https://github.com/web-arena-x/webarena/issues/137
425: update the intent to be more accurate

**Individual fix**
Template 324, remove ranking requirement. 
Template 204: Use a combination of context_qa and must_include. 
792, 793 were deleted because the reason is not very sound
Fix errors found by THU group [THU-Webarena-lite Bug Fixing](https://docs.google.com/spreadsheets/d/13BRuRlU_Z_UBcucjQ5myvrRdB0P0ID3Nj-dWlzawuYo/edit#gid=1021875443) 

**Typo, grammar**
by far -> so far
https://github.com/web-arena-x/webarena/issues/133
correpong -> corresponding 
telll -> tell
canlled -> cancelled
what could -> how could
competative -> competitive

Evaluator

Support result cache so that evaluation can be run offline. This is helpful if we accept submissions in the future. The participants only needs to upload their cached files and we can perform evaluation quickly without reruning their models

Sep 08 '24 06:09 shuyanzhou

visualwebarena visualwebarena copied to clipboard

Documentation, WebArena 2.0, Evaluation Cache

Documentation

WebArena 2.0

Evaluator

visualwebarena
visualwebarena copied to clipboard