DivScene
DivScene copied to clipboard
The code of the paper "DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects"
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects
This repository is the official implementation of the paper DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects.
An illustration of the data curation.
Requirements
Embodied Environment
DivScene is built on the simulator AI2-THOR on MacOS with
the Python version 3.9.16.
You can find the requirement in the github repository of Holodeck
and follow their instructions to build the environment.
Specifically, the commit we used is 156f8e10 at the link
Embodied Agent
We train and test our agent NatVLM with the Megatron-LM framework on Linux (CentOS).
The requirements are shown in requirement.txt.
DivScene Data
In our work, we build a new scene dataset DivScene with 4614 houses with 81 distinct scene types.
The data DivScene.zip are released at DivScene-DivTraj on HuggingFace Hub.
You can use unzip and 7z commands to extract house jsons from the zip file. The split of training/validation/test are shown in split_file.json.
Notice: our houses are built using Holodeck so that you need to config the objaverse asset correctly.
Build new houses:
- Gather the textual house descriptions with the code in
sample_data/gather_gpt4_prompt. - Input those descriptions in Holodeck.
- Use the
sample_data/regenerate_init_position.pyto search a valid init position of the embodied agent.
DivTraj Data
Similarly, the episodes of shortest paths we sampled are at DivScene-DivTraj on HuggingFace Hub. There are 5 episodes per house in the training set and 4 episodes per house in validation and test sets.
Format of episode name: {house_id}-{traj_id}
New episodes sampling: Use the sample_data/generate_trajectories.py to generate more trajectories.
Instruction File: We uploaded to the DivScene-DivTraj huggingface dataset. Here is the link.
Training Models
1. Prepare Data: We revise our training code based on Megatron-LM framework and Pai-Megatron.
We provide a Large Vision Language Models (LVLM) with the instruction of a step and ask it to generate the next step. Here, we follow the instruction data format of
Llava. We use convert_to_llava_format_with_pos_cot.py to convert DivTraj trajectories into the Llava format and
also list useful commands in convert_to_llava_format.sh. The instruction file is at here.
2. Train Model:
- First, use
webdatasetto compress the data. The script isagent_training/toolkits/pretrain_data_preprocessing/move_bulk_data.py.webdatasetcan speed up the data loading when training the model. - The training script is
agent_training/examples/idefics2/train_llava_instruct_webdataset_cot.sh. We also leave some commands inagent_training/examples/idefics2/run_cot_cmd.sh - Please use code in
model_checkpoints_convertorto convert the model between huggingface format and megatron format.
LICENSE ISSUE: We released our revision of Pai-Megatron and Megatron-LM. If you use those code, all licenses are subject to their original release.
Inference
We conduct the inference with the model-serving mode. We deploy trained LVLM on Linux servers. Then, we run the AI2-THOR on MacOS and call the api of the LVLM to finish navigation.
- See commands in
agent_inference/run_server.shto deploy the model with FastAPI. - Run commands in
agent_inference/run_client.shon MacOS with AI2-THOR to test your model.
Citation
Please cite the repo if you use the data or code.
@inproceedings{wang2024divscenebenchmarkinglvlmsobject,
title={DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects},
author={Zhaowei Wang and Hongming Zhang and Tianqing Fang and Ye Tian and Yue Yang and Kaixin Ma and Xiaoman Pan and Yangqiu Song and Dong Yu},
year={2024},
eprint={2410.02730},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02730}
}
Contributing
This repo is maintained by Zhaowei Wang