OpenAdapt Evaluate on OSWorld

Evaluate on OSWorld

Open abrichr opened this issue 9 months ago • 1 comments

Feature request

We would like to test OpenAdapt's ability to perform the tasks in https://os-world.github.io/.

This may involve creating recordings of the tasks described in the benchmark, since (as per https://github.com/xlang-ai/OSWorld/tree/main/evaluation_examples) the data sample are formatted as:

{
    "id": "uid", # unique id
    "snapshot": "snapshot_id", # the snapshot id of the environment, with some data already there and apps already opened, or just desktop
    "instruction": "natural_language_instruction", # the natural language instruction of the task, what we want the agent to do
    "source": "website_url", # where we know this example, some forum, or some website, or some paper
    "config": {xxx}, # the scripts to setup the donwload and open files actions, as the initial state of a task
    "trajectory": "trajectory_directory", # the trajectory directory, which contains the action sequence file, the screenshots and the recording video
    "related_apps": ["app1", "app2", ...], # the related apps, which are opened during the task
    "evaluator": "evaluation_dir", # the directory of the evaluator, which contains the evaluation script for this example
…
}

The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

Unfortunately this file does not appear to be included in the repo. Therefore completing this evaluation may involve manually re-creating the trajectories via openadapt.record.

Motivation

Evaluation

Apr 29 '24 00:04 abrichr

Related: https://github.com/xlang-ai/OSWorld/issues/30

Jun 13 '24 13:06 abrichr

OpenAdapt OpenAdapt copied to clipboard

Evaluate on OSWorld

Feature request

Motivation

OpenAdapt
OpenAdapt copied to clipboard