GenHancer
GenHancer copied to clipboard
(ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers (ICCV 2025)
Shijie Ma1,2,
Yuying Ge1,β,
Teng Wang1,
Yuxin Guo1,2,
Yixiao Ge1,
Ying Shan1
1ARC Lab, Tencent PCG,
2Institute of Automation, CAS
β‘ TL;DR
How do generative models effectively help discriminative models?
We present in-depth explorations and propose a novel two-stage post-training strategy to enhance CLIP ViT's visual representations.
Our method is applicable to both continuous and discrete denoiser without the requirement for pre-trained weights.
π News
- [2025-06-26] GenHancer is accepted to ICCV 2025! πππ
- [2025-03-27] Training codes with continuous denoisers are released! π₯π₯π₯
- [2025-03-26] arXiv paper is made publicly available.
- [2025-03-24] Release evaluation codes. π₯
- [2025-03-24] Release models weights on Huggingfaceπ€. π₯π₯π₯
- [2025-03-24] Release the project page of this repo.
π TODOs
- [x] Release training codes of continuous denoisers.
- [ ] Release training codes of discrete denoisers.
π Introduction
Recent works demonstrate the feasibility of enhancing visual representations with generative models, where generative models take visual tokens as conditions and perform reconstruction. However, the underlying principle remains underexplored.
We empirically reveal that perfect generation (reconstruction) does not always yield desirable visual representations, as shown below:

In this work, we delve into three aspects to explore the critical factors: (1) conditioning mechanisms, (2) denoising configurations and (3) generation paradigms.
We propose a two-stage post-training method to enhance CLIP ViT's fine-grained visual representations, which is efficient (with only lightweight denoisers) and versatile (applicable to both continuous and discrete denoisers). The pipeline of our method is illustrated below:

[!Important]
We empirically found that, for visual representations, a visually perfect generative model is not optimal and not necessary.
Our method only employs lightweight generative models and does NOT require any pre-trained weights, which is efficient and could avoid potential privacy and copyright issues.
β Released Weights
We release the enhanced CLIP weights on Huggingfaceπ€.
| CLIP Backbone | MMVP-VLM (Original) | MMVP-VLM (Ours) | Link |
|---|---|---|---|
| OpenAICLIP ViT-L-14@224 | 19.3 | 31.9 | π€ |
| OpenAICLIP ViT-L-14@336 | 20.0 | 29.6 | π€ |
| MetaCLIP ViT-L-14@224 | 23.7 | 31.9 | π€ |
| MetaCLIP ViT-H-14@224 | 25.2 | 37.0 | π€ |
| SigLIP ViT-SO-14@224 | 37.8 | 42.2 | π€ |
| SigLIP ViT-SO-14@384 | 37.0 | 40.0 | π€ |
π Training
Please come into the corresponding directories for more details.
For the continuous denoiser, navigate into Continuous.
For the discrete denoiser, navigate into Discrete.
π Evaluation
Please first download the benchmark MMVP-VLM.
We provide evaluation scripts of six CLIP backbones. The example of OpenAICLIP@224 is as follows:
python evaluation/evaluate_mmvp_OpenAICLIP_224.py --benchmark_dir 'YOUR_MMVP_VLM_PATH' --vision_tower_name 'YOUR_VISION_TOWER'
[!note]
Please specify
--vision_tower_nameas your trained CLIP model, which is conventionally saved viasave_pretrained().If you want to evaluation the official CLIP model like OpenAICLIP@224, you could specify
--vision_tower_nameas the officialhf_repo_id, e.g.,openai/clip-vit-large-patch14.
π€ Acknowledgements
When building the codebase of continuous denosiers, we refer to x-flux. Thanks for their wonderful project. Notably, we do NOT use their pre-trained weights.
π License
This repository is under the Apache 2 License.
π BibTeX
@article{ma2025genhancer,
title={GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers},
author={Ma, Shijie and Ge, Yuying and Wang, Teng and Guo, Yuxin and Ge, Yixiao and Shan, Ying},
journal={arXiv preprint arXiv:2503.19480},
year={2025}
}
π§ Contact
If you have further questions, feel free to contact me: [email protected]
Discussions and potential collaborations are also welcome.