VQASynth

Enhance the reasoning of multimodal models with pipelines to synthesize VQA datasets.

Background

Inspired by SpatialVLM, this repo uses ZoeDepth to adapt Vision Langauge Models for spatial reasoning. The demos feature pipelines using LLaVA for object captioning and SAM for segmentation. One uses CLIPSeg for region proposal, while the other uses GroundingDINO.

Environment

Before running the demo scripts, ensure you have the following installed:

CLIPSeg-based SpatialVLM data processing (recommended):

cd tests/data_processing/
docker build -f clipseg_data_processing.dockerfile -t vqasynth:clipseg-dataproc-test .
docker run --gpus all -v /path/to/output/:/path/to/output vqasynth:clipseg-dataproc-test --input_image="warehouse_rgb.jpg" --output_dir "/path/to/output"

GroundingDINO-based SpatialVLM data processing:

cd tests/data_processing/
docker build -f groundingDino_data_processing.dockerfile -t vqasynth:dino-dataproc-test .
docker run --gpus all -v /path/to/output/:/path/to/output vqasynth:dino-dataproc-test --input_image="warehouse_rgb.jpg" --output_dir "/path/to/output"

The scripts will produce 3D point clouds, segmented images, labels, and prompt examples for a test image.

Run a Pipeline on Your Images

The main pipeline uses Docker Compose to process a directory of images into a VQA dataset including spatial relations between objects. The dataset follows conventions for training models like LLaVA. We recommend using an A10 GPU or larger for processing.

Make sure to update .env with the full path to your image directory and output directory. Then launch the pipeline with:

cd /path/to/VQASynth
docker compose -f pipelines/spatialvqa.yaml up --build

In your designated output directory, you'll find a json file processed_dataset.json containing the formatted dataset.

Here are some examples:


Does the red forklift in warehouse appear on the left side of the brown cardboard boxes stacked?	How close is the man in red hat walking from the wooden pallet with boxes?	Does the man in blue shirt working have a greater height compared to the wooden pallet with boxes on floor?
Incorrect, the red forklift in warehouse is not on the left side of the brown cardboard boxes stacked.	The man in red hat walking is 60.13 centimeters from the wooden pallet with boxes.	Indeed, the man in blue shirt working is taller compared to the wooden pallet with boxes on floor.

Here's a sample of warehouse images captioned with spatial relationships similar to the table above.

wget https://remyx.ai/assets/vqasynth/vqasynth_warehouse_spaces.zip

# Data is formatted for LLaVA fine-tuning
unzip vqasynth_warehouse_spaces.zip

Once completed, you can follow this resource on fine-tuning LLaVa.

Models

Check out our LLaVA 1.5 LoRA SpaceLLaVA and MobileVLM-based SpaceLLaVA-lite

Try SpaceLLaVA in Discord

Notebooks

We've hosted some notebooks visualizing and experimenting with the techniques included in this repo.

Notebook	Description	Launch
Spatial Reasoning with Point Clouds	Visualize point clouds and evaluate spatial relationships

References

This project was inspired by or utilizes concepts discussed in the following research paper(s):

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

VQASynth
VQASynth copied to clipboard

Metadata

VQASynth

Background

Environment

Run a Pipeline on Your Images

Models

Notebooks

References

← Metadata

Owner

Metadata

VQASynth VQASynth copied to clipboard

Metadata

VQASynth

Background

Environment

Run a Pipeline on Your Images

Models

Notebooks

References

← Metadata

Owner

Metadata

VQASynth
VQASynth copied to clipboard