Awesome-Open-Vocabulary-Detection-and-Segmentation
Awesome-Open-Vocabulary-Detection-and-Segmentation copied to clipboard
Awesome OVD-OVS - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
:sparkles: PR is welcome!
Please remain tuned as this repo will be maintained on a week-to-week basis.
Todo
- [ ] Add detailed impls of each method, such as template prompts vs learnable prompts, CLIP text encoder vs BERT, initialization of image encoder, etc.
General Overview
In this survey, we cover two settings (zero-shot and open-vocabulary) and six tasks (object detection, semantic/instance/panoptic segmentation, 3D scene understanding, and video understanding). We pivot on the permission to weak supervision signals and the usage of weak supervision signals to build a taxonomy that is universal across these diverse settings and tasks. The weak supervision signals can be image-text pairs or large vision-language models. Below is a general overview of each methodology.
In current literature, zero-shot and open-vocabulary are used interchangeably, however, we note their subtle differences through the evolvement from traditional zero-shot to the newly formulated open-vocabulary setting.
Table of Contents
-
Zero-Shot Object Detection
- Visual-Semantic Space Mapping
- Novel Visual Feature Synthesis
-
Zero-Shot Segmentation
-
Zero-Shot Semantic Segmentation
- Visual-Semantic Space Mapping
- Novel Visual Feature Synthesis
- Zero-Shot Instance Segmentation
-
Zero-Shot Semantic Segmentation
-
Open-Vocabulary Object Detection
- Region-Aware Training
- Pseudo-Labeling
- Knowledge Distillation
- Transfer Learning
-
Open-Vocabulary Segmentation
-
Open-Vocabulary Semantic Segmentation
- Region-Aware Training
- Pseudo-Labeling
- Knowledge Distillation
- Transfer Learning
-
Open-Vocabulary Instance Segmentation
- Region-Aware Training
- Pseudo-Labeling
- Knowledge Distillation
-
Open-Vocabulary Panoptic Segmentation
- Region-Aware Training
- Knowledge Distillation
- Transfer Learning
-
Open-Vocabulary Semantic Segmentation
-
Open-Vocabulary 3D Scene Understanding
- Open-Vocabulary 3D Detection
-
Open-Vocabulary 3D Segmentation
- Open-Vocabulary 3D Semantic Segmentation
- Open-Vocabulary 3D Instance Segmentation
-
Open-Vocabulary Video Understanding
- Open-Vocabulary Video Instance Segmentation
- Acknowledgement
Zero-Shot Object Detection
Visual-Semantic Space Mapping
Venue | Paper Abbr | Project |
---|---|---|
ECCV'18 | ZSDv1 | N/A |
ACCV'18 & IJCV'20 | ZSDv2 | N/A |
AAAI'20 | CA-ZSR | Code |
AAAI'19 | ZSD-TD | N/A |
ACCV'20 | BLC | Code |
ICCV'19 | TL-ZSD | N/A |
arXiv'23 | SSB | N/A |
WACV'20 | MS-Zero | N/A |
TCSVT'19 | ZS-YOLO | N/A |
AAAI'21 | DPIF | Code |
TPAMI'21 | ContrastZSD | N/A |
IJCAI'20 | ZSD-CNN | N/A |
Novel Visual Feature Synthesis
Venue | Paper Abbr | Project |
---|---|---|
CVPR'20 | DELO | N/A |
ACCV'20 | SU | Code |
AAAI'20 | GTNet | Code |
CVPR'22 | RRFS | Code |
Zero-Shot Segmentation
Zero-Shot Semantic Segmentation
Visual-Semantic Space Mapping
Venue | Paper Abbr | Project |
---|---|---|
CVPR'20 | SPNet | Code |
NeurIPS'20 | ULZSS | Code |
ICCV'21 | JoEm | Code |
ICCVW'19 | VM | N/A |
ICCV'21 | PMOSR | N/A |
Novel Visual Feature Synthesis
Venue | Paper Abbr | Project |
---|---|---|
NeurIPS'19 | ZS3Net | Code |
NeurIPS'20 | CSRL | N/A |
MM'20 | CaGNet | Code |
ICCV'21 | SIGN | Code |
Zero-Shot Instance Segmentation
Venue | Paper Abbr | Project |
---|---|---|
CVPR'21 | ZSIS | Code |
Open-Vocabulary Object Detection
Region-Aware Training
Venue | Paper Abbr | Project | Text Encoder | Prompt | Image Backbone (w/ init. method) | Detector |
---|---|---|---|---|---|---|
CVPR'21 | OVR-CNN | Code | BERT | ❌ | R50 (IN-1K) | Faster R-CNN |
GCPR'22 | LocOv | Code | BERT | ❌ | R50 (IN-1K) | Faster R-CNN |
arXiv'23 | MMC-Det | N/A | BERT | ❌ | R50 (N/A) | Faster R-CNN/CenterNetv2 |
NeurIPS'22 | DetCLIP | N/A | FILIP | T (cat+def) | Swin | ATSS |
CVPR'23 | DetCLIPv2 | N/A | FILIP | T (cat+def) | Swin | ATSS |
CVPR'24 | DetCLIPv3 | N/A | FILIP | T (cat+def) | Swin | DETR-like |
AAAI'24 | WSOVOD | Code | CLIP | T (cat) | R50 (IN-1K) | Faster R-CNN |
CVPR'23 | RO-ViT | N/A | CLIP | T (cat) | ViT (ALIGN) | Mask R-CNN |
ICCV'23 | CFM-ViT | N/A | CLIP | T (cat) | ViT (ALIGN) | Mask R-CNN |
ICCV'23 | DITO | Code | CLIP | T (cat) | ViT (CLIP, ALIGN, DataComp-1B) | Faster R-CNN |
ICLR'23 | VLDet | Code | CLIP | T (cat) | R50 (IN-1K) | Faster R-CNN/CenterNetv2 |
ICCV'23 | GOAT | N/A | CLIP | T (cat) | R50 (IN-1K/RegionCLIP) | Faster R-CNN/CenterNetv2 |
ECCV'22 | OV-DETR | Code | CLIP | T (cat) | R50 (N/A) | Def-DETR |
arXiv'23 | Prompt-OVD | N/A | CLIP | T (cat) | ViTDet (IN-1K) | Def-DETR |
CVPR'23 | CORA | N/A | CLIP | T (cat) | R50 (N/A) | SAM-DETR/CenterNetv2 |
ICCV'23 | EdaDet | Code | CLIP | T (cat) | ||
ICCV'21 | MDETR | Code | ||||
ECCV'22 | MAVL | Code | ||||
NeurIPS'24 | MQ-Det | Code | ||||
CVPR'24 | YOLO-World | Code | ||||
MM'23 | SGDN | N/A | RoBERTa | ❌ |
Pseudo-Labeling
Venue | Paper Abbr | Project | Text Encoder | Prompt |
---|---|---|---|---|
CVPR'22 | RegionCLIP | Code | CLIP | T (cat) |
ECCV'22 | VL-PLM | Code | ||
CVPR'22 | GLIP | Code | ||
NeurIPS'22 | GLIPv2 | Code | ||
arXiv'23 | Grounding-DINO | Code | ||
ECCV'22 | PromptDet | Code | CLIP | L (cat+desc) |
arXiv'23 | SAS-Det | Code | CLIP | T (cat) |
ECCV'22 | PB-OVD | Code | CLIP | T (cat) |
AAAI'24 | CLIM | Code | CLIP | T (cat) |
arXiv'22 | VTP-OVD | N/A | CLIP | T (cat) |
AAAI'24 | ProxyDet | Code | CLIP | T (cat) |
NeurIPS'23 | CoDet | Code | CLIP | T (cat) |
ECCV'22 | Detic | Code | CLIP | T (cat) |
ICML'23 | MMC | Code | CLIP | GPT-3 |
arXiv'23 | 3Ways | N/A | CLIP | T (cat) |
arXiv'23 | PLAC | N/A | CLIP | T (cat) |
arXiv'23 | PCL | N/A | ||
NeurIPS'24 | OWLv2 | Code |
Knowledge Distillation
Venue | Paper Abbr | Project | Text Encoder | Prompt |
---|---|---|---|---|
ICLR'22 | ViLD | Code | CLIP | T (cat) |
ICDMW'22 | ZSD-YOLO | Code | CLIP | T (cat+desc) |
WACV'24 | LP-OVOD | Code | CLIP | T (cat) |
arXiv'23 | EZSD | Code | CLIP | T (cat) |
AAAI'24 | SIC-CADS | Code | CLIP | T (cat) |
CVPR'23 | BARON | Code | CLIP | T (cat) |
CVPR'23 | OADP | Code | CLIP | T (cat) |
arXiv'23 | GridCLIP | N/A | ||
NeurIPS'22 | RKDWTF | Code | CLIP | T (cat) |
ICCV'23 | DK-DETR | Code | CLIP | T (cat) |
CVPR'22 | HierKD | Code | CLIP | T (cat/desc) |
CVPR'22 | DetPro | Code | CLIP | L (cat) |
arXiv'23 | CLIPSelf | Code | CLIP | T (cat) |
Transfer Learning
Venue | Paper Abbr | Project | Text Encoder | Prompt |
---|---|---|---|---|
ECCV'22 | OWL-ViT | Code | CLIP | T (cat) |
CVPR'23 | UniDetector | Code | ||
ICLR'23 | F-VLM | Code | CLIP | T (cat) |
CVPR'23 | ScaleDet | N/A | ||
ICCV'23 | OpenSeed | Code | ||
arXiv'23 | DRR | N/A | CLIP | T (cat) |
arXiv'23 | Sambor | Code |
Open-Vocabulary Segmentation
Open-Vocabulary Semantic Segmentation
Region-Aware Training
Venue | Paper Abbr | Project |
---|---|---|
ECCV'22 | OpenSeg | N/A |
arXiv'23 | SLIC | N/A |
CVPR'22 | GroupViT | Code |
ECCV'22 | ViL-Seg | N/A |
ICML'23 | SegCLIP | Code |
CVPR'23 | OVSegmentor | Code |
CVPR'23 | PACL | N/A |
CVPR'23 | TCL | Code |
ECCV'22 | SimSeg | Code |
Pseudo-Labeling
Venue | Paper Abbr | Project |
---|---|---|
ECCV'22 | TTD | N/A |
Knowledge Distillation
Venue | Paper Abbr | Project |
---|---|---|
arXiv'23 | GKC | N/A |
arXiv'23 | SAM-CLIP | N/A |
ICCV'23 | ZeroSeg | Code |
Transfer Learning
Venue | Paper Abbr | Project |
---|---|---|
ICLR'22 | LSeg | Code |
CVPR'23 | SAZS | Code |
MM'23 | CEL | N/A |
CVPR'22 | ZegFormer | Code |
NeurIPS'22 | ReCo | Project |
arXiv'23 | SCAN | N/A |
ECCV'22 | ZSSeg | Code |
ECCV'22 | MaskCLIP | Code |
arXiv'23 | CLIP-DINOiser | Code |
PRCV'23 | MVP-SEG | N/A |
arXiv'23 | OVDiff | Project |
WACV'24 | FOSSIL | N/A |
NeurIPS'24 | POMP | Code |
NeurIPS'24 | AttrSeg | N/A |
arXiv'23 | PnP-OVSS | Code |
arXiv'23 | TagAlign | Project |
arXiv'23 | SelfSeg | N/A |
CVPR'22 | DenseCLIP | Code |
CVPR'23 | OVSeg | Code |
arXiv'23 | CAT-Seg | Code |
arXiv'23 | SED | Code |
NeurIPS'23 | MAFT | Code |
arXiv'23 | TagCLIP | N/A |
CVPR'23 | ZegCLIP | Code |
CVPR'22 | CLIPSeg | Code |
CVPR'23 | SAN | Code |
arXiv'23 | CLIP Surgery | Code |
arXiv'23 | CaR | Project |
Open-Vocabulary Instance Segmentation
Region-Aware Training
Venue | Paper Abbr | Project |
---|---|---|
ICCV'23 | CGG | Code |
CVPR'23 | D2Zero | Code |
Pseudo-Labeling
Venue | Paper Abbr | Project |
---|---|---|
CVPR'23 | XPM | Code |
CVPR'23 | Mask-free OVIS | Code |
arXiv'23 | MosaicFusion | Code |
Knowledge Distillation
Venue | Paper Abbr | Project |
---|---|---|
arXiv'24 | OV-SAM | Code |
Open-Vocabulary Panoptic Segmentation
Region-Aware Training
Venue | Paper Abbr | Project |
---|---|---|
arXiv'24 | Uni-OVSeg | Code |
CVPR'23 | X-Decoder | Code |
CVPR'24 | APE | Code |
Knowledge Distillation
Venue | Paper Abbr | Project |
---|---|---|
CVPR'23 | PADing | Code |
Transfer Learning
Venue | Paper Abbr | Project |
---|---|---|
NeurIPS'23 | FC-CLIP | Code |
CVPR'23 | FreeSeg | Project |
arXiv'24 | PosSAM | Project |
ICCV'23 | MasQCLIP | Project |
CVPR'23 | OMG-Seg | Code |
arXiv'23 | Semantic-SAM | Code |
CVPR'23 | ODISE | Code |
NeurIPS'23 | HIPIE | Code |
ICML'23 | MaskCLIP | Project |
ICCV'23 | OPSNet | N/A |
Open-Vocabulary 3D Scene Understanding
Open-Vocabulary 3D Detection
Venue | Paper Abbr | Project |
---|---|---|
CVPR'23 | OV-3DET | Code |
AAAI'24 | FM-OV3D | Code |
arXiv'23 | OpenSight | N/A |
NeurIPS'23 | CoDA | Code |
arXiv'23 | L3Det | N/A |
Open-Vocabulary 3D Segmentation
Open-Vocabulary 3D Semantic Segmentation
Venue | Paper Abbr | Project |
---|---|---|
arXiv'21 | SeCondPoint | N/A |
3DV'21 | 3DGenZ | Code |
CVPR'23 | OpenScene | Project |
CVPR'23 | PLA | Code |
arXiv'23 | RegionPLC | Project |
Open-Vocabulary 3D Instance Segmentation
Venue | Paper Abbr | Project |
---|---|---|
NeurIPS'23 | OpenMask3D | Project |
CVPR'24 | MaskClustering | Project |
arXiv'23 | OpenIns3D | Project |
arXiv'23 | Open3DIS | Project |
Open-Vocabulary Video Understanding
Open-Vocabulary Video Instance Segmentation
Venue | Paper Abbr | Project |
---|---|---|
ICCV'23 | OV2Seg | Code |
arXiv'23 | OpenVIS | Code |
arXiv'24 | BriVIS | Code |
Acknowledgement
If you find our survey helpful, please consider citing our paper:
@article{survey-ovd-ovs,
title={A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future},
author={Chaoyang Zhu and Long Chen},
journal={arXiv preprint arXiv:2307.09220},
year={2023}
}