visualwebarena
visualwebarena copied to clipboard
Configurations Setup for all Model Types
Hello,
Could you please share some of the configuration settings to reproduce the various model types?
I tried to reproduce the caption-augmented setup (Acc Tre + Caps) but my value was closer to the Multimodal result that had the Image Screenshot also as an input. Hoping I could get more clarification on how to switch between the 4 modes.
Here is my configurations
(1) Text-Only observation_type: accessibility_tree action_set_tag: id_accessibility_tree
(2) Caption-Augmented observation_type: accessibility_tree_with_captioner action_set_tag: id_accessibility_tree
(3) Multimodal observation_type: ??? action_set_tag: id_accessibility_tree
(4) Multimodal (SoM) observation_type: image_som action_set_tag: som