reverse_vlm
reverse_vlm copied to clipboard
π₯ [NeurIPS 2025] Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling (REVERSE)"
Welcome to the official repository for our paper: Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling. Explore our project page here for an interactive overview!
Authors: Tsung-Han Wu, Heekyung Lee, Jiaxin Ge, Joseph E. Gonzalez, Trevor Darrell, David M. Chan (UC Berkeley & POSTECH)
π Model Checkpoints:
- π€ tsunghanwu/reverse_llava_v15
- π€ tsunghanwu/reverse_llava_more
- π€ tsunghanwu/reverse_qwen25_vl
π¦ Dataset:
π Change Log:
- [04/17/2025]: REVERSE is now live on HuggingFace and GitHub! Explore checkpoints, dataset, and full paper from our project site.
- [05/29/2025]: REVERSE supports Qwen2.5-VL and is also effective. Check it out!
:wrench: Installation Guide
- Clone this repository
git clone https://github.com/tsunghan-wu/reverse_vlm
cd reverse_vlm
- Set up the environment
-
For LLaVA:
conda create -n reverse python=3.10 -y conda activate reverse pip install --upgrade pip # enable PEP 660 support pip install -e . pip install -e ".[train]" pip install flash-attn --no-build-isolation --no-cache-dir -
For Qwen series, please follow the installation guideline in Qwen2-VL-Finetune.
π Evaluation
-
Download model checkpoints:
- π€ reverse_llava_v15 (LLaVA-v1.5-7B style model)
- π€ reverse_llava_more (LLaVA with LLama3.1-8B-Instruct style model)
- π€ tsunghanwu/reverse_qwen25_vl (Qwen2.5-VL-3B-Instruct style model)
-
Download required evaluation files from Google Drive
β Unzip and place them intoplayground/data/eval. Then, follow the included instructions to download additional assets. -
Run evaluations with:
bash scripts/eval/*.sh
We conduct 100-round bootstrapped evaluation. Reported numbers should closely match those in the paper.
π Training
1. Data Preparation
-
Download QA pairs from: π€ tsunghanwu/reverse-instruct-1.3m
-
Organize datasets under
playground/data/using the following structure (following LLaVAβs layout):
Show Data Structure
playground/data/
βββ coco
β βββ annotations
β βββ test2017
β βββ train2017
β βββ val2017
βββ gqa
β βββ images
βββ ocr_vqa
β βββ images
βββ share_textvqa
β βββ images
βββ textvqa
β βββ train_images
βββ vg
βββ VG_100K
βββ VG_100K_2
2. Model Setup and Training
- Add special tokens to the base LLM:
python3 scripts/add_new_token_to_llava.py
python3 scripts/add_new_token_to_qwen.py
Supported settings:
-
LoRA finetuning for LLaVA-series
- lmsys/vicuna-7b-v1.5, with mm_projector weights from LLaVA-v1.5-7B's projector
- meta-llama/Llama-3.1-8B-Instruct, with mm_projector weights from LLaVA-MORE-8B's projector
-
Direct finetuning for the Qwen2.5-VL model
- Qwen/Qwen2.5-VL-3B-Instruct
- To ensure the apple-to-apple comparison, we fine-tune the released Qwen2.5-VL-3B model using both the LLaVA-FT setup and our REVERSE recipe, applying both on the same 100k subset. This allows us to directly compare the impact of our training/inference recipe against the basic training/inference baseline under consistent conditions as the Qwen2.5-VL's instruction tuning data is not publicly available.
-
Launch Training:
bash scripts/train/*.sh
3. Merge LoRA Weights (for LLaVA series only)
After training, merge the LoRA adapter weights into the base model:
CUDA_VISIBLE_DEVICES=5 python3 scripts/merge_lora_weights.py --model-path <your lora path> --model-base <the base llm path with new tokens> --save-model-path <final model path>
β οΈ Notes:
- Set
GPU_SETTINGSandMASTER_PORTappropriately when using DeepSpeed.- Naming matters:
- Your LoRA directory should contain
llava_lora- The final merged model path should contain
llava
This is required due to how LLaVA loads models internally β otherwise, it may fail silently or load incorrectly.
π Acknowledgements
We are grateful for the foundational code provided by LLaVA, LLaVA-More, and Fine-tuning Qwen2-VL Series. Utilizing their resources implies agreement with their respective licenses. Our project benefits greatly from these contributions, and we acknowledge their significant impact on our work.
π Citation
If you use our work or our implementation in this repo or find them helpful, please consider giving a citation.
@article{wu2025reverse,
title={Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling},
author={Wu, Tsung-Han and Lee, Heekyung and Ge, Jiaxin and Gonzalez, Joseph E and Darrell, Trevor and Chan, David M},
journal={arXiv preprint arXiv:2504.13169},
year={2025}
}