Awesome-Object-Tracking
Awesome-Object-Tracking copied to clipboard
Survey: A collection of AWESOME papers and resources on the latest research in Object Tracking.
Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models
Welcome to the Awesome-Object-Tracking repository!
This repository is a curated collection of the most influential papers, code implementations, benchmarks, and resources related to Object Tracking across Single Object Tracking (SOT), Multi-Object Tracking (MOT), Long-Term Tracking (LTT), and Foundation Modelβbased Tracking.
Our work is based on the following paper:
π Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models
Available on TechRxiv, ResearchGate, Preprints, PDF
Authors
Rahul Raja - Linkedin, Carnegie Mellon University
Arpita Vats - Linkedin, Boston University, Santa Clara University
Omkar Thawakar - Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Tajamul Ashraf - Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Feel free to β star and fork this repository to keep up with the latest advancements and contribute to the community.
If our work has been of assistance to you, please feel free to cite our survey. Thank you.
@article{202509.2051,
doi = {10.20944/preprints202509.2051.v1},
url = {https://doi.org/10.20944/preprints202509.2051.v1},
year = 2025,
month = {September},
publisher = {Preprints},
author = {Rahul Raja and Arpita Vats and Omkar Thawakar and Tajamul Ashraf},
title = {Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models},
journal = {Preprints}
}
π Object Tracking Surveys
| Title | Task | Publication Date | Link |
|---|---|---|---|
| Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models | SOT/MOT/VLM | 2025 | Preprints |
| Deep Learning-Based Multi-Object Tracking: A Comprehensive Survey from Foundations to State-of-the-Art | MOT | 2025 | Arxiv |
| Multiple object tracking: A literature review | MOT | 2023 | ScienceDirect |
| Transformers in Single Object Tracking: An Experimental Survey | SOT | 2023 | Arxiv |
| Visual object tracking: A survey | MOT | 2022 | ScienceDirect |
| A Survey of Long-Term Visual Tracking | LTT | 2022 | IEEE |
| Single Object Tracking: A Survey of Methods, Datasets, and Evaluation Metrics | MOT | 2022 | Arxiv |
| Deep Learning in Visual Object Tracking: A Review | SOT | 2021 | IEEE |
| Deep Learning For Visual Tracking: A Comprehensive Survey | SOT | 2021 | Arxiv |
| Deep Learning for Generic Object Detection: A Survey | SOT/MOT | 2019 | Springer |
| A Survey of Multiple Object Tracking | MOT | 2016 | IEEE |
| Object tracking: A survey | SOT/MOT | 2006 | ACM |
π Single Object Tracking (SOT) Models
- MDNet: Multi-Domain Convolutional Network for Visual Tracking [Paper]
- GOTURN: Learning to Track at 100 FPS with Deep Regression Networks [Paper]
- TLD: Tracking-Learning-Detection [Paper]
- SiamFC: Fully Convolutional Siamese Networks for Object Tracking [Paper]
- SiamRPN: High Performance Visual Tracking with Siamese Region Proposal Network [Paper]
- SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks [Paper]
- TransT: Transformer Tracking [Paper]
- STARK: Learning Spatio-Temporal Transformer for Visual Tracking [Paper]
- ToMP: Transformer-based Online Multiple-Template Prediction [Paper]
- ATOM: Accurate Tracking by Overlap Maximization [Paper]
- DiMP: Learning Discriminative Model Prediction for Tracking [Paper]
- SiamRCNN: Target Tracking by Re-detection [Paper]
- SiamBAN: Balanced Attention Network for Visual Tracking [Paper]
- MixFormer: End-to-End Tracking with MixFormer [Paper]
π Multi-Object Tracking (MOT) Models
π¦ Detection-Guided
- DeepSORT: Simple Online and Realtime Tracking with a Deep Association Metric [Paper]
- StrongSORT: Strong Baselines for DeepSORT [Paper]
- Tracktor++: Leveraging the Tracking-by-Detection Paradigm for Object Tracking [Paper]
- ByteTrack: Multi-Object Tracking by Associating Every Detection Box [Paper]
- MR2-ByteTrack: Multi-Resolution & Resource-Aware ByteTrack [Paper]
- LG-Track: Local-Global Association Framework [Paper]
- Deep LG-Track: Deep Local-Global MOT with Enhanced Features [Paper]
- RTAT: Robust Two-Stage Association Tracker [Paper]
- Wu et al. β ACCV MOT Framework [Paper]
π¦ Detection-Integrated
- FairMOT: FairMOT: On the Fairness of Detection and Re-ID in MOT [Paper]
- CenterTrack: Objects as Points for Tracking [Paper]
- QDTrack: Quasi-Dense Similarity Learning for MOT [Paper]
- Speed-FairMOT: Lightweight FairMOT for Real-Time Applications [Paper]
- TBDQ-Net: Tracking by Detection-Query Efficient Network [Paper]
- JDTHM: Joint Detection-Tracking with Hierarchical Memory [Paper]
π¦ Transformer-Based
- TrackFormer: Tracking by Query with Transformer [Paper]
- TransTrack: Transformer-based MOT with Cross-Frame Attention [Paper]
- ABQ-Track: Anchor-Based Query Transformer for MOT [Paper]
- MeMOTR: Memory-Augmented Transformer for MOT [Paper]
- Co-MOT: Collaborative Transformer for Multi-Object Tracking [Paper]
π¦ Multi-Modal / 3D MOT
- DS-KCF: Depth-based Scale-adaptive KCF [Paper]
- OTR: Object Tracking by Reconstruction [Paper]
- DPANet: Depth-aware Panoptic Association Network [Paper]
- AB3DMOT: Simple Baseline for 3D MOT [Paper]
- CenterPoint: Center-based 3D Object Tracking [Paper]
- RGB-D Tracking: Depth-Based Multi-Object Tracking [Paper]
- CS Fusion: Multi-Modal 3D MOT with Cross-Sensor Fusion [Paper]
π¦ ReID Aware Methods
- JDE: Joint Detection and Embedding for Real-Time MOT [Paper]
- TransReID: Transformer-based Object Re-Identification [Paper]
πΉ Long-Term Tracking (LTT) Models
- TLD: Tracking-Learning-Detection [Paper]
- DaSiamRPN: Distractor-Aware Siamese RPN for LTT [Paper]
- SiamRPN++ (LT): Improved Siamese RPN with Global Search [Paper]
- LTTrack: Occlusion-Aware Long-Term MOT with Zombie Pool Re-activation [Paper]
- MambaLCT: Memory-Augmented Long-Term Tracking with State-Space Models [Paper]
π Foundation Models, VLM, and Multimodal Tracking
| Model | Description | Paper | Code |
|---|---|---|---|
| TrackAnything | Segment and track arbitrary objects in video using SAM-based vision transformers. Provides interactive video editing and annotation capabilities. | Paper | Code |
| CLDTracker | Chain-of-Language driven tracker that integrates reasoning chains into vision-language tracking. Enables flexible text-guided association for MOT. | Paper | Code |
| EfficientTAM | Lightweight tracking-anything framework designed for efficiency. Maintains strong performance while reducing compute and memory footprint. | Paper | Code |
| SAM-PD | Prompt-driven tracking method built on SAM. Allows flexible prompts to initiate and update object trajectories across frames. | Paper | Code |
| SAM-Track | Combines SAM with DeAOT to segment and track anything in videos. Achieves robust long-term tracking and high-quality segmentation masks. | Paper | Code |
| SAMURAI | Builds on SAM2 with a memory-gating mechanism for improved temporal stability. Handles long occlusions and challenging re-identifications. | Paper | Code |
| OVTrack | Open-vocabulary multi-object tracker using CLIP and transformers. Supports free-form text prompts for category-agnostic tracking. | Paper | Code |
| LaMOTer | Language-Motion Transformer for MOT that fuses linguistic cues with motion features. Improves robustness in ambiguous tracking cases. | Paper | Code |
| PromptTrack | Prompt-driven tracker designed for autonomous driving. Leverages vision-language prompts to improve adaptability to unseen road objects. | Paper | Code |
| UniVS | Unified vision and speech multimodal tracker. Processes both audio and video streams for enhanced disambiguation in challenging environments. | Paper | Code |
| ViPT | Visual prompt tuning framework for object tracking. Introduces learnable prompts for adapting foundation models to tracking tasks. | Paper | Code |
| MemVLT | Memory-augmented vision-language tracker. Encodes long-term context to maintain identity consistency across occlusions. | Paper | Code |
| DINOTrack | Builds on DINOv2 for self-supervised tracking. Uses patch-level matching for robust representation without labeled data. | Paper | Code |
| VIMOT | Vision-language multimodal tracker evaluated on driving datasets. Supports multi-class and open-world tracking scenarios. | Paper | N/A |
| BLIP-2 | Bootstrapped language-image pretraining model. Serves as a general-purpose vision-language backbone adaptable to tracking. | Paper | Code |
| GroundingDINO | Open-set object detection with language prompts. Provides strong grounding for vision-language tracking pipelines. | Paper | Code |
| Flamingo | Large-scale multimodal few-shot learner with frozen LMs. Capable of integrating temporal reasoning across modalities. | Paper | Code |
| SAM2MOT | Extends SAM2 for segmentation-based multi-object tracking. Targets open-world and promptable tracking challenges. | Paper | Code |
| DTLLM-VLT | Dynamic tracking with LLM-vision fusion. Incorporates large language models for reasoning over visual tracking states. | Paper | Code |
| DUTrack | Dynamic update mechanism with language-driven adaptation. Enhances model robustness in evolving visual environments. | Paper | Code |
| UVLTrack | Unified vision-language tracking across multiple modalities. Provides flexible open-vocab evaluation with diverse prompts. | Paper | Code |
| All-in-One | Multimodal tracking framework combining vision and language encoders. Offers a versatile baseline for fusion strategies. | Paper | Code |
| Grounded-SAM | Combines GroundingDINO and SAM for open-vocabulary tracking. Strengthens grounding accuracy for segmentation-driven MOT. | Paper | Code |
π Object Tracking Benchmarks
πΉ Single Object Tracking (SOT)
- OTB-2013: Online Object Tracking Benchmark [Paper]
- VOT: Visual Object Tracking Challenge [Paper]
- LaSOT: Large-scale Single Object Tracking [Paper]
- TrackingNet: Large-scale Object Tracking Dataset [Paper]
- GOT-10k: Generic Object Tracking Benchmark [Paper]
- UAV123: UAV Aerial Tracking Benchmark [Paper]
- FELT: Long-term Multi-camera Tracking [Paper]
- NT-VOT211: Night-time Visual Object Tracking [Paper]
- OOTB: Out-of-Orbit Tracking Benchmark [Paper]
- GSOT3D: Generalized 3D Object Tracking [Paper]
πΉ Multi-Object Tracking (MOT)
- MOT15: MOTChallenge 2015 [Paper]
- MOT17: MOTChallenge 2017 [Paper]
- MOT20: MOTChallenge 2020 [Paper]
- KITTI Tracking Benchmark [Paper]
- BDD100K: Diverse Driving Dataset [Paper]
- TAO: Tracking Any Object [Paper]
- DanceTrack: A New Benchmark for Multi-Human Tracking [Paper]
- EgoTracks: Egocentric MOT [Paper]
- OVTrack: Open-Vocabulary MOT [Paper]
πΉ Long-Term Tracking (LTT)
- OxUvA: Oxford Long-Term Tracking Benchmark [Paper]
- UAV20L: Long-Term UAV Tracking [Paper]
- LaSOT-Ext: Extended LaSOT Dataset [Paper]
- TREK-150: AR/VR Long-Term Tracking Benchmark [Paper]
πΉ Vision-Language & Multimodal Benchmarks (VLM)
- BURST: Bursty Event Grounding Benchmark [Paper]
- LVBench: Large-Scale Vision-Language Benchmark [Paper]
- TNL2K-VLM: Tracking by Natural Language Queries [Paper]