Mobile Device Evaluations -- AndroidControl, GuiOdyssey, et al.
Hello!
Awesome X announcement that you folks put out for Simular and kudos on the Agent S & Agent S2 papers.
I was curious about the performance of the Agent S2 system on some notable, static offline datasets for Android device manipulation. While AndroidWorld provides one useful signal for a system's capability to operate mobile phones, there is a wide variety of tasks that are not captured in it's distribution, which I believe are present among other open source device manipulation datasets, such as:
Would it be possible to evaluate the Agent S2 system on these data sources? This question is complementary to this issue related to the evaluation setup for AndroidWorld.
following
🥇
I think that the community would also be curious about the artifacts for the latest Agent S3 model. Would it be possible to make available some of the inference results? Like trajectories of model interaction with the environments (screenshots with the annotated actions from each step).