Awesome-Temporal-Sentence-Grounding-in-Videos

A curated list of Temporal Sentence Grounding in Videos papers and benchmarks.
The task is also usually referred to as:

Single Video Moment Retrieval (SVMR)
Temporal Activity Localization via Language Query (TALL)
Natural Language Grounding in Videos.

Task definition:

a) Given an untrimmed video and a language query, the video grounding task aims to localize a temporal moment (t_s,t_e) in the video that matches the query.

b-d) Represent a high-level overview of common multi-modality interaction schemes investigated in the literature.

00 - Table of Contents

01 - Datasets
02 - Benchmark Results
03 - Papers
- Analysis and Surveys
- Early works - 2017 - 2018 - 2019 - 2020 - 2021

01 - Datasets

Videos Statistics

Dataset

Features
(Download)

Number of Videos

Avg.
Duration

Total
Duration

Train

Val

Test

Minutes

Hours

TACoS

C3D

4.78

10.1

Charades-STA

VGG16
I3D (LGI)
I3D (DRN)

5336

1334

0.50

57.1

DiDeMo

VGG16

8511

1094

1037

0.50

88.7

ActivityNet Captions

C3D

10009

4917 (val1)
4885(val2)

5044

1.96

487.6

MAD

CLIP

488

112

110.77

1207.3

Sentences Statistics

Dataset

Features
(Download)

Number of Queries

Avg.
Tokens

Total
Tokens

Train

Val

Test

(Millions)

TACoS

10146

4589

4083

10.5

0.2

Charades-STA

12404

3720

7.2

0.1

DiDeMo

33005

4180

4021

8.0

0.3

ActivityNet Captions

37421

17505 (val1)
17031(val2)

14.8

1.0

MAD

CLIP

280183

32064

72044

12.7

5.0

Language Statistics - (Unique tokens)

Dataset

Adjectives

Nouns

Verbs

Vocabulary

TACoS

0.2 K

0.9 K

0.6 K

2.3 K

Charades-STA

0.1 K

0.6 K

0.4 K

1.3 K

DiDeMo

0.6 K

4.1 K

1.9 K

7.5 K

ActivityNet Captions

1.1 K

7.4 K

3.7 K

15.4 K

MAD

5.3 K

35.5 K

13.1 K

61.4 K

02 - Benchmark Results

Evaluation metric: Recall@k for IoU=m (link).
NOTE: For Activitynet-Captions, val1 / val2 or a combination of the two splits is used for evaluation. The most common choice is to use val1 as a validation set and val2 as a testing set. This is necessary as the official test set is withheld for competitions purposes.

Methods can be classified in:

FS: Fully supervised
WS: Weakly supervised
RL: Reinforcement Learning

Format

* `Model` [ID](link) | `Features` |  R@k IoU=m |...| R@k IoU=m | Method |

Hit the paper ID to fast-forward to the paper details (link to pdf, venue, year, author and link to GitHub repo).

ActivityNet Captions (val 1)

Models	Features	R@1 IoU0.3	R@1 IoU0.5	R@1 IoU0.7	R@5 IoU0.3	R@5 IoU0.5	R@5 IoU0.7	Method
ACRN [12]	C3D	31.29	16.17	-	-	-	-	FS
A2C [19]	C3D	-	36.90	-	-	-	-	RL
DEBUG [27]	C3D	55.91	39.72	-	-	-	-	FS
ExCL [28]	I3D	63.00	43.60	23.60	-	-	-	FS
TSP-PRL [37]	C3D	56.08	38.76	-	-	-	-	RL
GDP [40]	C3D	56.17	39.27	-	-	-	-	FS
DRN [41]	C3D	-	42.49	22.25	-	71.85	45.96	FS
VSLNet [48]	I3D	63.16	43.22	26.16	-	-	-	FS

ActivityNet Captions (val 2)

Models	Features	R@1 IoU0.3	R@1 IoU0.5	R@1 IoU0.7	R@5 IoU0.3	R@5 IoU0.5	R@5 IoU0.7	Method
CTRL [6]	C3D	47.43	29.01	-	75.32	59.17	-	FS
TGN [10]	C3D VGG16 Inception-V4	43.81 42.24 45.51	27.93 23.90 28.47	11.86 - -	54.56 51.82 57.32	44.20 40.17 43.33	24.84 - -	FS
QSPN [17]	C3D	52.12	33.26	-	77.72	62.39	-	FS
WSDEC-W [26]		62.7	42.00	23.3	-	-	-	WS
WSLLN [26]		75.4	42.80	22.7	-	-	-	WS
CMIN [29]	C3D	64.41	44.62	24.48	82.39	69.66	52.96	FS
2D-TAN (pool) [38]	C3D	59.45	44.51	26.54	85.53	77.13	61.96	FS
2D-TAN (conv) [38]	C3D	58.75	44.05	27.38	85.65	76.65	62.26	FS
SCN [39]	C3D	47.23	29.22	-	71.45	55.69	-	WS
DRN [41]	C3D	-	45.45	24.36	-	77.97	50.30	FS
HVTG [45]	OBJ	57.60	40.15	18.27	-	-	-	FS
PMI [46]	C3D	59.69	38.28	17.83	-	-	-	FS
DPIN [54]	C3D	62.40	47.27	28.31	87.52	77.45	60.03	FS
FIAN [55]	C3D	64.10	47.90	29.81	87.59	77.64	59.66	FS
CSMGAN [56]	C3D	68.52	49.11	29.15	87.68	77.43	59.63	FS
SMRN [58]	C3D	-	42.97	26.79	-	76.46	60.51	FS
VLG-Net [67]	C3D	-	46.32	29.82	-	77.15	63.33	FS

ActivityNet Captions (val 1 + val2)

Models	Features	R@1 IoU0.3	R@1 IoU0.5	R@1 IoU0.7	R@5 IoU0.3	R@5 IoU0.5	R@5 IoU0.7	Method
QSPN [17]	C3D	45.30	27.70	13.60	75.70	59.20	38.30	FS
ABLR [20]	C3D	55.67	36.79	-	-	-	-	RL
SCDM [25]	C3D	54.80	36.75	19.86	77.29	64.99	41.53	FS
CBP [36]	C3D	54.30	35.76	17.80	77.63	65.89	46.20	FS
LGI [43]	C3D	58.52	41.51	23.07	-	-	-	FS
TripNet [47]	C3D	48.42	32.19	13.93	-	-	-	RL
TMLGA [49]	I3D	51.28	33.04	19.26	-	-	-	FS

TACoS (test)

Models	Features	R@1 IoU0.1	R@1 IoU0.3	R@1 IoU0.5	R@1 IoU0.7	R@5 IoU0.1	R@5 IoU0.3	R@5 IoU0.5	R@5 IoU0.7	Method
CTRL [6]	C3D	24.32	18.32	13.30	-	48.73	36.69	25.42	-	FS
TGN [10]	C3D	41.87	21.77	18.90	11.88	53.40	39.06	31.02	15.26	FS
ACRN [12]	C3D	24.22	19.52	14.62	-	47.42	34.97	24.88	-	FS
MCF [13]	C3D	25.84	18.64	12.53	-	52.96	37.13	24.73	-	FS
ROLE [14]	C3D	20.37	15.38	9.94	-	45.45	31.17	20.13	-	FS
VAL [15]	C3D	25.74	19.76	14.74	-	51.87	38.55	26.52	-	FS
QSPN [17]	C3D	25.31	20.15	15.23	-	53.21	36.72	25.30	-	FS
ABLR [20]	C3D	34.70	19.50	9.40	-	-	-	-	-	FS
SAP [21]	VGG16	31.15	-	18.24	-	53.51	-	28.11	-	FS
SMRL [24]	VGG16	26.51	20.25	15.95	-	50.01	38.47	27.84	-	RL
SCDM [25]	C3D	-	26.11	21.17	-	-	40.16	32.18	-	FS
DEBUG [27]	C3D	41.15	23.45	11.72	-	-	-	-	-	FS
ExCL [28]	I3D	-	45.50	28.00	13.80	-	-	-	-	FS
CMIN [29]	C3D I3D	36.88 41.73	27.33 32.35	19.57 22.54	- -	64.93 69.15	43.35 50.75	28.53 32.11	- -	FS
SLTA [31]	C3D + FRCNN	23.13	17.07	11.92	-	46.52	32.90	20.86	-	FS
ACL-K [32]	C3D	31.64	24.17	20.01	-	57.85	42.15	30.66	-	FS
CBP [36]	C3D	-	27.31	24.79	19.10	-	43.64	37.40	25.59	FS
2D-TAN (Pool) [38]	C3D	47.59	37.29	25.32	-	70.31	57.81	45.04	-	FS
2D-TAN convl) [38]	C3D	46.44	35.22	25.19	-	74.43	56.94	44.21	-	FS
GDP [40]	C3D	39.68	24.14	13.50	-	-	-	-	-	FS
DRN [41]	C3D	-	-	23.17	-	-	-	33.36	-	FS
TripNet [47]	C3D	-	23.95	19.17	9.52	-	-	-	-	RL
VSLNet [48]	I3D	29.61	24.27	20.03	-	-	-	-	-	FS
TMLGA [49]	I3D	-	24.54	21.65	16.46	-	-	-	-	FS
DPIN [54]	C3D	59.04	46.74	32.92	-	75.78	62.16	50.26	-	FS
FIAN [55]	C3D	39.55	33.87	28.58	-	56.14	47.76	39.16	-	FS
CSMGAN [56]	C3D	42.74	33.90	27.09	-	68.97	53.98	41.22	-	FS
SMRN [58]	C3D	50.44	42.49	32.07	-	77.28	66.63	52.84	-	FS
LGN [64]	C3D	52.46	41.71	30.57	-	76.86	63.06	50.76	-	FS
VLG-Net [67]	C3D	57.21	45.46	34.19	-	81.80	70.38	56.56	-	FS

DiDeMo (test)

Models	Features	R@1 IoU0.5	R@1 IoU0.7	R@1 IoU1.0	R@5 IoU0.5	R@5 IoU0.7	R@5 IoU1.0	Method
MCN [5]	VGG16 Flow VGG16+Flow VGG16+Flow+TEF	- - - -	- - - -	13.10 18.35 19.88 28.10	- - - -	- - - -	44.82 56.25 62.39 78.21	FS
TMN [9]	VGG16 Flow VGG16+Flow	- -	- - -	18.71 19.90 22.92	- - -	- - -	72.97 75.14 76.08	FS
TGN [10]	VGG16 Flow VGG16+Flow	- -	- - -	24.28 27.52 28.23	- - -	- - -	71.43 76.94 79.26	FS
ACRN [12]	VGG16	27.44	16.65	-	69.43	29.45	-	FS
ROLE [14]	VGG16	29.40	15.68	-	70.72	33.08	-	FS
MAN [22]	TAN	-	-	27.02	-	-	81.70	FS
TGA [23]	VGG16+Flow	-	-	12.19	-	-	39.74	WS
SMRL [24]	VGG16+FRCNN	-	-	31.06	-	-	80.45	RL
WSLLN [26]	VGG16 Flow	- -	- -	19.40 18.40	- -	- -	53.10 54.40	WS
SLTA [31]	VGG16+FRCNN	30.92	17.16	-	70.18	33.87	-	FS
VLANet [44]	VGG16	-	-	19.32	-	-	65.68	WS
RTBPN [51]	VGG16 Flow VGG16+Flow	- -	- - -	20.38 20.52 20.79	- - -	- - -	55.88 57.72 60.26	WS
VLG-Net [67]	VGG16	33.35	25.57	25.57	88.86	71.72	71.65	FS
LoGAN [69]	VGG16+Flow	-	-	39.20	-	-	64.04	WS

Charades-STA (test)

Models	Features	R@1 IoU0.3	R@1 IoU0.5	R@1 IoU0.7	R@5 IoU0.3	R@5 IoU0.5	R@5 IoU0.7	Method
CTRL [6]	C3D	-	23.63	8.89	-	58.92	29.52	FS
ACRN [12]	C3D	-	20.26	7.64	-	71.99	27.79	FS
ROLE [14]	C3D	-	21.74	7.82	-	70.37	30.06	FS
VAL [15]	C3D	-	23.12	9.16	-	61.26	27.98	FS
ASST [16]	C3D	-	42.72	24.06	-	71.32	43.98	FS
QSPN [17]	C3D	54.70	35.60	15.80	95.80	79.40	45.40	FS
ABLR [20]	C3D	-	24.36	9.01	-	-	-	FS
SAP [21]	VGG16	-	27.42	13.36	-	66.37	38.15	FS
MAN [22]	VGG16 I3D	- -	41.24 46.53	20.54 22.72	- -	83.21 86.23	51.85 53.72	FS
TGA [23]	---	32.14	19.94	8.84	56.58	65.52	33.51	WS
SMRL [24]	VGG16	-	24.36	11.17	-	61.25	32.08	RL
SCDM [25]	I3D	-	54.44	33.43	-	74.43	58.08	FS
DEBUG [27]	C3D	-	37.39	17.69	-	-	-	FS
ExCL [28]	I3D	65.10	44.10	22.40	-	-	-	FS
SLTA [31]	C3D+FRCNN	-	22.81	8.25	-	72.39	31.46	FS
ACL [32]	C3D	-	26.47	11.23	-	61.51	33.23	FS
ACL-K [32]	C3D	-	30.48	12.20	-	64.84	35.13	FS
CBP [36]	C3D	-	36.80	18.87	-	70.94	50.19	FS
TSP-PRL [37]	C3D	-	37.39	17.69	-	-	-	RL
TSP-PRL [37]	Two Streams	-	45.30	24.73	-	-	-	RL
2D-TAN (pool) [38]	VGG16	-	39.70	23.31	-	80.32	51.26	FS
2D-TAN (conv) [38]	VGG16	-	39.81	23.25	-	79.33	52.15	FS
SCN [39]	C3D	42.96	23.58	9.97	95.56	71.80	38.87	WS
GDP [40]	C3D	-	39.47	18.49	-	-	-	FS
DRN [41]	VGG16 C3D I3D	- - -	42.90 45.40 53.09	23.68 26.40 31.75	- - -	87.80 88.01 89.06	54.87 55.38 60.05	FS
LGI [43]	I3D	-	59.46	35.48	-	-	-	FS
VLANet [44]	C3D	-	31.83	14.17	-	82.85	33.09	WS
HVTG [45]	FRCNN	-	47.27	23.30	-	-	-	FS
PMI [46]	C3D	-	39.73	19.27	-	-	-	FS
TripNet [47]	C3D	51.33	38.29	16.07	-	-	-	RL
VSLNet [48]	I3D	-	54.19	35.22	-	-	-	FS
TMLGA [49]	I3D	67.53	52.02	33.74	-	-	-	FS
RTBPN [51]	C3D	60.04	32.36	13.24	97.48	71.85	41.18	WS
DPIN [54]	VGG16	-	47.98	26.96	-	85.53	55.00	FS
FIAN [55]	I3D	-	58.55	37.72	-	87.80	63.52	FS
WSTG [61]	---	39.80	27.30	12.90	-	-	-	WS
LGN [64]	VGG16	-	48.15	26.67	-	86.80	53.01	FS
LoGAN [69]	C3D	-	34.68	14.54	-	74.30	39.11	WS
<!--	AVMR [53]	ResNet	77.72	54.59	-	88.92	72.78	-

03 - Papers

Markdown format:

* `ID` | `Model Acronym` | `Conference` | [Paper Name](link) | Author 1 et al |  [GitHub](link)

Analysis and Surveys

ID	Model	Venue	Title	Authors	Code
-	`--`	`BMVC 2020`	Uncovering Hidden Challenges in Query-Based Video Moment Retrieval	Otani et al
-	`--`	`AAAI 2022`	A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics	Yuan et al	GitHub
-	`--`	`ArXiv 2021`	A Survey on Temporal Sentence Grounding in Videos	LAN et al
-	`--`	`ArXiv 2022`	he Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions	Zhang et al
-	`--`	`Arxiv`	A Survey on Natural Language Video Localization	Liu et al

Early works

ID	Model	Venue	Title	Authors
1	`--`	`ACL 2013`	Grounded Language Learning from Video Described with Sentences	Yu et al
2	`--`	`CVPR 2014`	Visual Semantic Search: Retrieving Videos via Complex Textual Queries	Lin et al
3	`--`	`AAAI 2015`	Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework	Xu et al
4	`--`	`IJCAI 2016`	Unsupervised Alignment of Actions in Video with Text Descriptions	Song et al

2017

ID	Model	Venue	Title	Authors	Code
5	`MCN`	`ICCV`	Localizing Moments in Video with Natural Language	Hendricks et al	GitHub
6	`CTRL`	`ICCV`	TALL: Temporal Activity Localization via Language Query	Gao et al	GitHub
7	`--`	`ArXiv`	Where to Play: Retrieval of Video Segments using Natural-Language Queries	Lee et al

2018

ID	Model	Venue	Title	Authors	Code
8	`FIFO`	`ECCV`	Find and Focus: Retrieve and Localize Video Events with Natural Language Queries	Shao et al
9	`TMN`	`ECCV`	Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos	Liu et al
10	`TGN`	`EMNLP`	Temporally Grounding Natural Sentence in Video	Chen et al	GitHub
11	`TEMPO`	`EMNLP`	Localizing Moments in Video with Temporal Language	Hendricks et al	GitHub
12	`ACRN`	`SIGIR`	Attentive Moment Retrieval in Videos	Liu et al	GitHub
13	`MCF`	`IJCAI`	Multi-modal Circulant Fusion for Video-to-Language and Backward	Wu et al	GitHub
14	`ROLE`	`ACM MM`	Cross-modal Moment Localization in Videos	Liu et al	GitHub
15	`VAL`	`PRCM`	VAL: Visual-attention action localizer	Song et al
16	`ASST`	`ArXiv`	Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions	Ning et al

2019

ID	Model	Venue	Title	Authors	Code
17	`QSPN`	`AAAI`	Multilevel Language and Vision Integration for Text-to-Clip Retrieval	Xu et al	GitHub
18	`LNet`	`AAAI`	Localizing Natural Language in Videos	Chen et al
19	`A2C`	`AAAI`	Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos	Dongliang et al	GitHub
20	`ABLR`	`AAAI`	To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression	Yuan et al	GitHub
21	`SAP`	`AAAI`	Semantic Proposal for Activity Localization in Videos via Sentence Query	Chen et al
22	`MAN`	`CVPR`	MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment	Zhang et al	GitHub
23	`TGA`	`CVPR`	Weakly Supervised Video Moment Retrieval From Text Queries	Mithun et al	GitHub
24	`SMRL`	`CVPR`	Language-Driven Temporal Activity Localization_ A Semantic Matching Reinforcement Learning Model	Wang et al
25	`SCDM`	`NIPS`	Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos	Yuan et al	GitHub
26	`WSLLN`	`EMNLP`	WSLLN: Weakly Supervised Natural Language Localization Networks	Gao et al
27	`DEBUG`	`EMNLP`	DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization	Lu et al
28	`ExCL`	`NAACL`	ExCL: Extractive Clip Localization Using Natural Language Descriptions	Ghosh et al
29	`CMIN`	`SIGIR`	Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos	Zhang et al	GitHub
30	`CMIN`	`IEEE`	Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction	Zhang et al	GitHub
31	`SLTA`	`ICMR`	Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention	Jiang et al	GitHub
32	`ACL`	`WACV`	MAC: Mining Activity Concepts for Language-based Temporal Localization	Ge et al	GitHub
33	`WSSTG`	`ACL`	Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video	Chen et al	GitHub
34	`TCMN`	`ACM`	Exploiting Temporal Relationships in Video Moment Localization with Natural Language	Zhang et al	GitHub
35	`CAL`	`ArXiv`	Temporal Localization of Moments in Video Collections with Natural Language	Escorcia et al	GitHub

2020

ID	Model	Venue	Title	Authors	Code
36	`CBP`	`AAAI`	Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction	Wang et al	GitHub
37	`TSP-PRL`	`AAAI`	Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video	Wu et al	GitHub
38	`2DTAN`	`AAAI`	Learning 2D Temporal Localization Networks for Moment Localization with Natural Language	Zhang et al	GitHub1, GitHub2
39	`SCN`	`AAAI`	Weakly-Supervised Video Moment Retrieval via Semantic Completion Network	Lin et al
40	`GDP`	`AAAI`	Rethinking the Bottom-Up Framework for Query-based Video Localization	Chen et al
41	`DRN`	`CVPR`	Dense Regression Network for Video Grounding	Zeng et al	GitHub
42	`STGRN`	`CVPR`	Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences	Zhang et al	GitHub
43	`LGI`	`CVPR`	Local-Global Video-Text Interactions for Temporal Grounding	Mun et al	GitHub
44	`VLANet`	`ECCV`	VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval	Ma et al
45	`HVTG`	`ECCV`	Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language	Chen et al	GitHub
46	`PMI`	`ECCV`	Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos	Chen et al
47	`TripNet`	`BMVC`	Tripping through time Efficient Localization of Activities in Videos	Hahn et al
48	`VSLNet`	`ACL`	Span-based Localizing Network for Natural Language Video Localization	Zhang et al	GitHub
49	`TMLGA`	`WACV`	Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention	Rodriguez-Opazo et al	GitHub
50	`--`	`NIPS`	Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding	Zhang et al
51	`RTBPN`	`ACM`	Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos	Zhang et al
52	`STRONG`	`ACM`	STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization	Cao et al
53	`AVMR`	`ACM`	Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization	Cao et al
54	`DPIN`	`ACM`	Dual Path Interaction Network for Video Moment Localization	Wang et al
55	`FIAN`	`ACM`	Fine-grained Iterative Attention Network for Temporal Language Localization in Videos	Qu et al
56	`CSMGAN`	`ACM`	Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization	Liu et all	GitHub
57	`--`	`DAVU`	Cross-Modality Video Segment Retrieval with Ensemble Learning	Yu et al
58	`SMRN`	`ISNN`	Semantic Modulation Based Residual Network for Temporal Language Queries Grounding in Video	Chen et al
59	`--`	`Journal`	Cross-modal video moment retrieval based on visual-textual relationship alignment	Chen et al
60	`--`	`ArXiv`	Video Moment Retrieval via Natural Language Queries	Yu et al
61	`WSTG`	`ArXiv`	Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video	Chen et al
62	`MARN`	`ArXiv`	Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos	Song et al
63	`LGN`	`ArXiv`	Language Guided Networks for Cross-modal Moment Retrieval	Liu et al
64	`ACRM`	`ArXiv`	Frame-wise Cross-modal Match for Video Moment Retrieval	Tang et al
65	`CMA`	`ArXiv`	A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention	Zhang et al
66	`--`	`ArXiv`	Natural Language Video Localization: A Revisit in Span-based Question Answering Framework	Zhang et al

2021

ID	Model	Venue	Title	Authors	Code
67	`VLG-Net`	`ICCVW`	VLG-Net: Video-Language Graph Matching Network for Video Grounding	Soldan et al	GitHub
68	`LoGAN`	`WACV`	LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval	Tan et al
69	`CBLN`	`CVPR`	Context-aware Biaffine Localizing Network for Temporal Sentence Grounding	Liu et al	GitHub
70	`DeNet`	`CVPR`	Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding	Zhou et al
70	`DORi`	`WACV`	DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video	Rodriguez-Opazo et al	GitHub
71	`PEARL`	`WACV`	Natural Language Video Moment Localization Through Query-Controlled Temporal Convolution	Zhang et al
72	`IVG-DCL`	`CVPR`	Interventional Video Grounding With Dual Contrastive Learning	Nan et al	GitHub
73	`SMIN`	`CVPR`	Structured Multi-Level Interaction Network for Video Moment Localization via Language Query	Wang et al
74	`--`	`CVPR`	Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos	Zhang et al
75	`MMRG`	`CVPR`	Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval	Zeng et al
76	`CPN`	`CVPR`	Cascaded Prediction Network via Segment Tree for Temporal Video Grounding	Zhao et al
77	`CRM`	`CVPR`	Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation	Huang et al
78	`FVMR`	`CVPR`	Fast Video Moment Retrieval	Gao et al
79	`RMN`	`ACL`	Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network	Liu et al
80	`--`	`ACL`	Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding	Wang et al
81	`VCA`	`ACM`	Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval	Wang et al
82	`CI-MHA`	`ACM`	Cross Interaction Network for Natural Language Guided Video Moment Retrieval	Yu et al
83	`MABAN`	`Journal`	MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval	Yu et al
84	`CFSTRI`	`Journal`	Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding	Qi et al
85	`--`	`Journal`	Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval	Teng et al
86	`ACRM`	`Journal`	Frame-wise Cross-modal Matching for Video Moment Retrieval	Tang et al
87	`DCT-net`	`Journal`	DCT-net: A deep co-interactive transformer network for video temporal grounding	Qi et al
88	`SV-VMR`	`Journal`	Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval	Wu et al
89	`CAN`	`Journal`	Context-aware network with foreground recalibration for grounding natural language in video	Chen et al
90	`--`	`Journal`	Multi-scale 2D Representation Learning for weakly-supervised moment retrieval	Li et al
91	`LCNet`	`Journal`	Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding	Yang et al
92	`CLEAR`	`Journal`	Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization	Hu et al
93	`VSLNet`	`Journal`	Natural Language Video Localization: A Revisit in Span-based Question Answering Framework	Zhang et al
94	`VSRNet`	`Journal`	VSRNet: End-to-end video segment retrieval with text query	Sun et al
95	`MS-2D-TAN`	`Journal`	Multi-Scale 2D Temporal Adjacency Networks for Moment Localization with Natural Language	Zhang et al	GitHub
96	`U-VMR`	`Journal`	Learning Video Moment Retrieval Without a Single Annotated Video	Gao et al
97	`CPNet`	`AAAI`	Proposal-Free Video Grounding with Contextual Pyramid Network	Li et al
98	`DepNet`	`AAAI`	Dense Events Grounding in Video	Bao et al
99	`BPNet`	`AAAI`	Boundary Proposal Network for Two-Stage Natural Language Video Localization	Xiao et al
100	`STVGBert`	`ICCV`	STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding	Su et al
101	`BSP`	`ICCV`	Boundary-sensitive Pre-training for Temporal Localization in Videos	Xu et al	GitHub
102	`SSCS`	`ICCV`	Support-Set Based Cross-Supervision for Video Grounding	Ding et al
103	`DCM`	`SIGIR`	Deconfounded Video Moment Retrieval with Causal Intervention	Yang et al	GitHub
104	`--`	`Arxiv`	Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair	Maeoki et al
105	`HDRR`	`Arxiv`	Hierarchical Deep Residual Reasoning for Temporal Moment Localization	Ma et al	GitHub
106	`RaNet`	`EMNLP`	Relation-aware Video Reading Comprehension for Temporal Language Grounding	Gao et al	GitHub
107	`GTR`	`Arxiv`	On Pursuit of Designing Multi-modal Transformer for Video Grounding	Cao et al
108	`SeqPAN`	`Arxiv`	Parallel Attention Network with Sequence Matching for Video Grounding	Zhang et al
109	`S^4TLG`	`Arxiv`	Self-supervised Learning for Semi-supervised Temporal Language Grounding	Luo et al
110	`IA-Net`	`EMNLP`	Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding	Liu et al
111	`LPNet`	`Arxiv`	Natural Language Video Localization with Learnable Moment Proposals	Xiao et al
112	`PLN`	`Arxiv`	Progressive Localization Networks for Language-based Moment Localization	Zheng et al
113	`SNEAK`	`Arxiv`	SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization	Gou et al
113	`MGSL-Net`	`Arxiv`	Memory-Guided Semantic Learning Network for Temporal Sentence Grounding	Liu et al
114	`MMFA-CF`	`IWACIII`	A Multi-modal Fusion Algorithm for Cross-modal Video Moment Retrieval	Jia et al

2022

ID	Model	Venue	Title	Authors	Code
115	`MARN`	`Arxiv`	Exploring Motion and Appearance Information for Temporal Sentence Grounding	Liu et al
116	`DebiasTLL`	`Arxiv`	Learning Sample Importance for Cross-Scenario Video Temporal Grounding	Bao et al
117	`DebiasTLL`	`Journal`	Video Moment Retrieval With Cross-Modal Neural Architecture Search	Yang et al	GitHub
118	`CDN`	`Journal`	Cross-modal Dynamic Networks for Video Moment Retrieval with Text Query	Yang et al	GitHub
119	`CDN`	`AAAI`	Unsupervised Temporal Video Grounding with Deep Semantic Clustering	Liu et al
120	`APGN`	`ACL`	Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos	Liu et al
121	`PLRN`	`AVSS`	Position-aware Location Regression Network for Temporal Video Grounding	Kim et al
122	`PRVG`	`Arxiv`	End-to-End Dense Video Grounding via Parallel Regression	Shi et al
123	`MQEI`	`Journal`	Multi-Level Query Interaction for Temporal Language Grounding	Tang et al
124	`LocFormer`	`Arxiv`	LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach	Rodriguez-Opazo et al
125	`STCM-Net`	`Journal`	STCM-Net: A symmetrical one-stage network for temporal language localization in videos	Jia et al
126	`TACI`	`Journal`	Learning to combine the modalities of language and video for temporal moment localization	Shin et al
127	`--`	`Arxiv`	Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding	Mo et al
128	`MA3SRN`	`Arxiv`	Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding	Liu et al
129	`--`	`AAAI`	Explore Inter-Contrast Between Videos via Composition for Weakly Supervised Temporal Sentence Grounding	Chen et al
130	`--`	`CVPR`	MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions	Soldan et al	GitHub
131	`--`	`Arxiv`	Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning	Li et al	GitHub

Licenses

To the extent possible under law, muketong all copyright and related or neighboring rights to this work.

Awesome-Temporal-Language-Grounding-in-Videos
Awesome-Temporal-Language-Grounding-in-Videos copied to clipboard

Metadata

Awesome-Temporal-Sentence-Grounding-in-Videos

Task definition:

00 - Table of Contents

01 - Datasets

Videos Statistics

Sentences Statistics

Language Statistics - (Unique tokens)

02 - Benchmark Results

Methods can be classified in:

Format

ActivityNet Captions (val 1)

ActivityNet Captions (val 2)

ActivityNet Captions (val 1 + val2)

TACoS (test)

DiDeMo (test)

Charades-STA (test)

03 - Papers

Analysis and Surveys

Early works

2017

2018

2019

2020

2021

2022

Licenses

← Metadata

Owner

Metadata

Awesome-Temporal-Language-Grounding-in-Videos Awesome-Temporal-Language-Grounding-in-Videos copied to clipboard

Metadata

Awesome-Temporal-Sentence-Grounding-in-Videos

Task definition:

00 - Table of Contents

01 - Datasets

Videos Statistics

Sentences Statistics

Language Statistics - (Unique tokens)

02 - Benchmark Results

Methods can be classified in:

Format

ActivityNet Captions (val 1)

ActivityNet Captions (val 2)

ActivityNet Captions (val 1 + val2)

TACoS (test)

DiDeMo (test)

Charades-STA (test)

03 - Papers

Analysis and Surveys

Early works

2017

2018

2019

2020

2021

2022

Licenses

← Metadata

Owner

Metadata

Awesome-Temporal-Language-Grounding-in-Videos
Awesome-Temporal-Language-Grounding-in-Videos copied to clipboard