REVERSE-VLM: Vision-Language Model with
REtrospective VERification and SElf-correction

Welcome to the official repository for our paper: Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling. Explore our project page here for an interactive overview!

Authors: Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan (UC Berkeley & POSTECH)

🔗 Model Checkpoints:

📦 Dataset:

🧾 REVERSE Visual Instruct 1.3M

📄 Change Log:

[04/17/2025]: REVERSE is now live on HuggingFace and GitHub! Explore checkpoints, dataset, and full paper from our project site.
[05/29/2025]: REVERSE supports Qwen2.5-VL and is also effective. Check it out!

:wrench: Installation Guide

Clone this repository

git clone https://github.com/tsunghan-wu/reverse_vlm
cd reverse_vlm

Set up the environment

For LLaVA:

conda create -n reverse python=3.10 -y
conda activate reverse
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

For Qwen series, please follow the installation guideline in Qwen2-VL-Finetune.

📈 Evaluation

Download model checkpoints:
- 🤗 reverse_llava_v15 (LLaVA-v1.5-7B style model)
- 🤗 reverse_llava_more (LLaVA with LLama3.1-8B-Instruct style model)
- 🤗 tsunghanwu/reverse_qwen25_vl (Qwen2.5-VL-3B-Instruct style model)
Download required evaluation files from Google Drive
→ Unzip and place them into playground/data/eval. Then, follow the included instructions to download additional assets.
Run evaluations with:

bash scripts/eval/*.sh

We conduct 100-round bootstrapped evaluation. Reported numbers should closely match those in the paper.

🚀 Training

1. Data Preparation

Download QA pairs from: 🤗 tsunghanwu/reverse-instruct-1.3m
Organize datasets under playground/data/ using the following structure (following LLaVA’s layout):

Show Data Structure

playground/data/
├── coco
│   ├── annotations
│   ├── test2017
│   ├── train2017
│   └── val2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── share_textvqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

2. Model Setup and Training

Add special tokens to the base LLM:

python3 scripts/add_new_token_to_llava.py
python3 scripts/add_new_token_to_qwen.py

Supported settings:

LoRA finetuning for LLaVA-series
- lmsys/vicuna-7b-v1.5, with mm_projector weights from LLaVA-v1.5-7B's projector
- meta-llama/Llama-3.1-8B-Instruct, with mm_projector weights from LLaVA-MORE-8B's projector
Direct finetuning for the Qwen2.5-VL model
- Qwen/Qwen2.5-VL-3B-Instruct
- To ensure the apple-to-apple comparison, we fine-tune the released Qwen2.5-VL-3B model using both the LLaVA-FT setup and our REVERSE recipe, applying both on the same 100k subset. This allows us to directly compare the impact of our training/inference recipe against the basic training/inference baseline under consistent conditions as the Qwen2.5-VL's instruction tuning data is not publicly available.
Launch Training: bash scripts/train/*.sh

3. Merge LoRA Weights (for LLaVA series only)

After training, merge the LoRA adapter weights into the base model:

CUDA_VISIBLE_DEVICES=5 python3 scripts/merge_lora_weights.py --model-path <your lora path> --model-base <the base llm path with new tokens> --save-model-path <final model path>

⚠️ Notes:

Set GPU_SETTINGS and MASTER_PORT appropriately when using DeepSpeed.

Naming matters:

Your LoRA directory should contain llava_lora

The final merged model path should contain llava
This is required due to how LLaVA loads models internally — otherwise, it may fail silently or load incorrectly.

🙏 Acknowledgements

We are grateful for the foundational code provided by LLaVA, LLaVA-More, and Fine-tuning Qwen2-VL Series. Utilizing their resources implies agreement with their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.

📚 Citation

If you use our work or our implementation in this repo or find them helpful, please consider giving a citation.

@article{wu2025reverse,
  title={Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling},
  author={Wu, Tsung-Han and Lee, Heekyung and Ge, Jiaxin and Gonzalez, Joseph E and Darrell, Trevor and Chan, David M},
  journal={arXiv preprint arXiv:2504.13169},
  year={2025}
}

reverse_vlm
reverse_vlm copied to clipboard

Metadata

REVERSE-VLM: Vision-Language Model with
REtrospective VERification and SElf-correction

🔗 Model Checkpoints:

📦 Dataset:

📄 Change Log:

:wrench: Installation Guide

📈 Evaluation

🚀 Training

1. Data Preparation

2. Model Setup and Training

3. Merge LoRA Weights (for LLaVA series only)

🙏 Acknowledgements

📚 Citation

← Metadata

Owner

Metadata

reverse_vlm reverse_vlm copied to clipboard

Metadata

REVERSE-VLM: Vision-Language Model with REtrospective VERification and SElf-correction

🔗 Model Checkpoints:

📦 Dataset:

📄 Change Log:

:wrench: Installation Guide

📈 Evaluation

🚀 Training

1. Data Preparation

2. Model Setup and Training

3. Merge LoRA Weights (for LLaVA series only)

🙏 Acknowledgements

📚 Citation

← Metadata

Owner

Metadata

reverse_vlm
reverse_vlm copied to clipboard

REVERSE-VLM: Vision-Language Model with
REtrospective VERification and SElf-correction