Awesome-Object-Tracking icon indicating copy to clipboard operation
Awesome-Object-Tracking copied to clipboard

Survey: A collection of AWESOME papers and resources on the latest research in Object Tracking.

trafficstars

Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

MIT License Maintenance Contribution Welcome Awesome

Oryx Video-ChatGPT

Welcome to the Awesome-Object-Tracking repository!
This repository is a curated collection of the most influential papers, code implementations, benchmarks, and resources related to Object Tracking across Single Object Tracking (SOT), Multi-Object Tracking (MOT), Long-Term Tracking (LTT), and Foundation Model–based Tracking.

Our work is based on the following paper:
πŸ“„ Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models
Available on TechRxiv, ResearchGate, Preprints, PDF

Authors

Rahul Raja - Linkedin, Carnegie Mellon University
Arpita Vats - Linkedin, Boston University, Santa Clara University
Omkar Thawakar - Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Tajamul Ashraf - Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE

Feel free to ⭐ star and fork this repository to keep up with the latest advancements and contribute to the community.

If our work has been of assistance to you, please feel free to cite our survey. Thank you.

@article{202509.2051,
	doi = {10.20944/preprints202509.2051.v1},
	url = {https://doi.org/10.20944/preprints202509.2051.v1},
	year = 2025,
	month = {September},
	publisher = {Preprints},
	author = {Rahul Raja and Arpita Vats and Omkar Thawakar and Tajamul Ashraf},
	title = {Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models},
	journal = {Preprints}
}

Taxonomy of object tracking paradigms, spanning historical foundations, single-object tracking (SOT), multi-object tracking (MOT), long-term tracking (LTT), and emerging trends leveraging foundation and vision-language models. Each branch highlights representative methods and architectures across the evolution of tracking research.

Timeline of object tracking research from classical foundations and deep learning, through hybrid and transformer-based trackers, to recent long-term, multi-modal, and foundation/VLM-powered approaches.

πŸ“š Object Tracking Surveys

Title Task Publication Date Link
Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models SOT/MOT/VLM 2025 Preprints
Deep Learning-Based Multi-Object Tracking: A Comprehensive Survey from Foundations to State-of-the-Art MOT 2025 Arxiv
Multiple object tracking: A literature review MOT 2023 ScienceDirect
Transformers in Single Object Tracking: An Experimental Survey SOT 2023 Arxiv
Visual object tracking: A survey MOT 2022 ScienceDirect
A Survey of Long-Term Visual Tracking LTT 2022 IEEE
Single Object Tracking: A Survey of Methods, Datasets, and Evaluation Metrics MOT 2022 Arxiv
Deep Learning in Visual Object Tracking: A Review SOT 2021 IEEE
Deep Learning For Visual Tracking: A Comprehensive Survey SOT 2021 Arxiv
Deep Learning for Generic Object Detection: A Survey SOT/MOT 2019 Springer
A Survey of Multiple Object Tracking MOT 2016 IEEE
Object tracking: A survey SOT/MOT 2006 ACM

πŸ“Œ Single Object Tracking (SOT) Models

  • MDNet: Multi-Domain Convolutional Network for Visual Tracking [Paper]
  • GOTURN: Learning to Track at 100 FPS with Deep Regression Networks [Paper]
  • TLD: Tracking-Learning-Detection [Paper]
  • SiamFC: Fully Convolutional Siamese Networks for Object Tracking [Paper]
  • SiamRPN: High Performance Visual Tracking with Siamese Region Proposal Network [Paper]
  • SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks [Paper]
  • TransT: Transformer Tracking [Paper]
  • STARK: Learning Spatio-Temporal Transformer for Visual Tracking [Paper]
  • ToMP: Transformer-based Online Multiple-Template Prediction [Paper]
  • ATOM: Accurate Tracking by Overlap Maximization [Paper]
  • DiMP: Learning Discriminative Model Prediction for Tracking [Paper]
  • SiamRCNN: Target Tracking by Re-detection [Paper]
  • SiamBAN: Balanced Attention Network for Visual Tracking [Paper]
  • MixFormer: End-to-End Tracking with MixFormer [Paper]

πŸ“Œ Multi-Object Tracking (MOT) Models

🟦 Detection-Guided

  • DeepSORT: Simple Online and Realtime Tracking with a Deep Association Metric [Paper]
  • StrongSORT: Strong Baselines for DeepSORT [Paper]
  • Tracktor++: Leveraging the Tracking-by-Detection Paradigm for Object Tracking [Paper]
  • ByteTrack: Multi-Object Tracking by Associating Every Detection Box [Paper]
  • MR2-ByteTrack: Multi-Resolution & Resource-Aware ByteTrack [Paper]
  • LG-Track: Local-Global Association Framework [Paper]
  • Deep LG-Track: Deep Local-Global MOT with Enhanced Features [Paper]
  • RTAT: Robust Two-Stage Association Tracker [Paper]
  • Wu et al. – ACCV MOT Framework [Paper]

🟦 Detection-Integrated

  • FairMOT: FairMOT: On the Fairness of Detection and Re-ID in MOT [Paper]
  • CenterTrack: Objects as Points for Tracking [Paper]
  • QDTrack: Quasi-Dense Similarity Learning for MOT [Paper]
  • Speed-FairMOT: Lightweight FairMOT for Real-Time Applications [Paper]
  • TBDQ-Net: Tracking by Detection-Query Efficient Network [Paper]
  • JDTHM: Joint Detection-Tracking with Hierarchical Memory [Paper]

🟦 Transformer-Based

  • TrackFormer: Tracking by Query with Transformer [Paper]
  • TransTrack: Transformer-based MOT with Cross-Frame Attention [Paper]
  • ABQ-Track: Anchor-Based Query Transformer for MOT [Paper]
  • MeMOTR: Memory-Augmented Transformer for MOT [Paper]
  • Co-MOT: Collaborative Transformer for Multi-Object Tracking [Paper]

🟦 Multi-Modal / 3D MOT

  • DS-KCF: Depth-based Scale-adaptive KCF [Paper]
  • OTR: Object Tracking by Reconstruction [Paper]
  • DPANet: Depth-aware Panoptic Association Network [Paper]
  • AB3DMOT: Simple Baseline for 3D MOT [Paper]
  • CenterPoint: Center-based 3D Object Tracking [Paper]
  • RGB-D Tracking: Depth-Based Multi-Object Tracking [Paper]
  • CS Fusion: Multi-Modal 3D MOT with Cross-Sensor Fusion [Paper]

🟦 ReID Aware Methods

  • JDE: Joint Detection and Embedding for Real-Time MOT [Paper]
  • TransReID: Transformer-based Object Re-Identification [Paper]

πŸ”Ή Long-Term Tracking (LTT) Models

  • TLD: Tracking-Learning-Detection [Paper]
  • DaSiamRPN: Distractor-Aware Siamese RPN for LTT [Paper]
  • SiamRPN++ (LT): Improved Siamese RPN with Global Search [Paper]
  • LTTrack: Occlusion-Aware Long-Term MOT with Zombie Pool Re-activation [Paper]
  • MambaLCT: Memory-Augmented Long-Term Tracking with State-Space Models [Paper]

πŸ“Œ Foundation Models, VLM, and Multimodal Tracking

Model Description Paper Code
TrackAnything Segment and track arbitrary objects in video using SAM-based vision transformers. Provides interactive video editing and annotation capabilities. Paper Code
CLDTracker Chain-of-Language driven tracker that integrates reasoning chains into vision-language tracking. Enables flexible text-guided association for MOT. Paper Code
EfficientTAM Lightweight tracking-anything framework designed for efficiency. Maintains strong performance while reducing compute and memory footprint. Paper Code
SAM-PD Prompt-driven tracking method built on SAM. Allows flexible prompts to initiate and update object trajectories across frames. Paper Code
SAM-Track Combines SAM with DeAOT to segment and track anything in videos. Achieves robust long-term tracking and high-quality segmentation masks. Paper Code
SAMURAI Builds on SAM2 with a memory-gating mechanism for improved temporal stability. Handles long occlusions and challenging re-identifications. Paper Code
OVTrack Open-vocabulary multi-object tracker using CLIP and transformers. Supports free-form text prompts for category-agnostic tracking. Paper Code
LaMOTer Language-Motion Transformer for MOT that fuses linguistic cues with motion features. Improves robustness in ambiguous tracking cases. Paper Code
PromptTrack Prompt-driven tracker designed for autonomous driving. Leverages vision-language prompts to improve adaptability to unseen road objects. Paper Code
UniVS Unified vision and speech multimodal tracker. Processes both audio and video streams for enhanced disambiguation in challenging environments. Paper Code
ViPT Visual prompt tuning framework for object tracking. Introduces learnable prompts for adapting foundation models to tracking tasks. Paper Code
MemVLT Memory-augmented vision-language tracker. Encodes long-term context to maintain identity consistency across occlusions. Paper Code
DINOTrack Builds on DINOv2 for self-supervised tracking. Uses patch-level matching for robust representation without labeled data. Paper Code
VIMOT Vision-language multimodal tracker evaluated on driving datasets. Supports multi-class and open-world tracking scenarios. Paper N/A
BLIP-2 Bootstrapped language-image pretraining model. Serves as a general-purpose vision-language backbone adaptable to tracking. Paper Code
GroundingDINO Open-set object detection with language prompts. Provides strong grounding for vision-language tracking pipelines. Paper Code
Flamingo Large-scale multimodal few-shot learner with frozen LMs. Capable of integrating temporal reasoning across modalities. Paper Code
SAM2MOT Extends SAM2 for segmentation-based multi-object tracking. Targets open-world and promptable tracking challenges. Paper Code
DTLLM-VLT Dynamic tracking with LLM-vision fusion. Incorporates large language models for reasoning over visual tracking states. Paper Code
DUTrack Dynamic update mechanism with language-driven adaptation. Enhances model robustness in evolving visual environments. Paper Code
UVLTrack Unified vision-language tracking across multiple modalities. Provides flexible open-vocab evaluation with diverse prompts. Paper Code
All-in-One Multimodal tracking framework combining vision and language encoders. Offers a versatile baseline for fusion strategies. Paper Code
Grounded-SAM Combines GroundingDINO and SAM for open-vocabulary tracking. Strengthens grounding accuracy for segmentation-driven MOT. Paper Code

πŸ“Š Object Tracking Benchmarks

πŸ”Ή Single Object Tracking (SOT)

  • OTB-2013: Online Object Tracking Benchmark [Paper]
  • VOT: Visual Object Tracking Challenge [Paper]
  • LaSOT: Large-scale Single Object Tracking [Paper]
  • TrackingNet: Large-scale Object Tracking Dataset [Paper]
  • GOT-10k: Generic Object Tracking Benchmark [Paper]
  • UAV123: UAV Aerial Tracking Benchmark [Paper]
  • FELT: Long-term Multi-camera Tracking [Paper]
  • NT-VOT211: Night-time Visual Object Tracking [Paper]
  • OOTB: Out-of-Orbit Tracking Benchmark [Paper]
  • GSOT3D: Generalized 3D Object Tracking [Paper]

πŸ”Ή Multi-Object Tracking (MOT)

  • MOT15: MOTChallenge 2015 [Paper]
  • MOT17: MOTChallenge 2017 [Paper]
  • MOT20: MOTChallenge 2020 [Paper]
  • KITTI Tracking Benchmark [Paper]
  • BDD100K: Diverse Driving Dataset [Paper]
  • TAO: Tracking Any Object [Paper]
  • DanceTrack: A New Benchmark for Multi-Human Tracking [Paper]
  • EgoTracks: Egocentric MOT [Paper]
  • OVTrack: Open-Vocabulary MOT [Paper]

πŸ”Ή Long-Term Tracking (LTT)

  • OxUvA: Oxford Long-Term Tracking Benchmark [Paper]
  • UAV20L: Long-Term UAV Tracking [Paper]
  • LaSOT-Ext: Extended LaSOT Dataset [Paper]
  • TREK-150: AR/VR Long-Term Tracking Benchmark [Paper]

πŸ”Ή Vision-Language & Multimodal Benchmarks (VLM)

  • BURST: Bursty Event Grounding Benchmark [Paper]
  • LVBench: Large-Scale Vision-Language Benchmark [Paper]
  • TNL2K-VLM: Tracking by Natural Language Queries [Paper]

Affiliations