Awesome-Temporal-Language-Grounding-in-Videos
Awesome-Temporal-Language-Grounding-in-Videos copied to clipboard
A curated list of grounding natural language in video and related area. :-)
Awesome-Temporal-Sentence-Grounding-in-Videosdata:image/s3,"s3://crabby-images/b792e/b792e6bb3df0596d7093e10314eb14d650d66367" alt="Awesome"
A curated list of Temporal Sentence Grounding in Videos papers and benchmarks.
The task is also usually referred to as:
- Single Video Moment Retrieval (SVMR)
- Temporal Activity Localization via Language Query (TALL)
- Natural Language Grounding in Videos.
Task definition:
a) Given an untrimmed video and a language query, the video grounding task aims to localize a temporal moment (ts,te) in the video that matches the query.
b-d) Represent a high-level overview of common multi-modality interaction schemes investigated in the literature.
00 - Table of Contents
- 01 - Datasets
- 02 - Benchmark Results
-
03 - Papers
- Analysis and Surveys
- Early works - 2017 - 2018 - 2019 - 2020 - 2021
01 - Datasets
Videos Statistics
Dataset |
Features |
Number of Videos |
Avg. |
Total |
||
Train |
Val |
Test |
Minutes |
Hours |
||
|
|
75 |
27 |
25 |
4.78 |
10.1 |
|
|
5336 |
0 |
1334 |
0.50 |
57.1 |
|
|
8511 |
1094 |
1037 |
0.50 |
88.7 |
|
|
10009 |
4917 (val1) |
5044 |
1.96 |
487.6 |
MAD |
CLIP |
488 |
50 |
112 |
110.77 |
1207.3 |
Sentences Statistics
Dataset |
Features |
Number of Queries |
Avg. |
Total |
||
Train |
Val |
Test |
|
(Millions) |
||
|
|
10146 |
4589 |
4083 |
10.5 |
0.2 |
|
|
12404 |
0 |
3720 |
7.2 |
0.1 |
|
|
33005 |
4180 |
4021 |
8.0 |
0.3 |
|
|
37421 |
17505 (val1) |
? |
14.8 |
1.0 |
MAD |
CLIP |
280183 |
32064 |
72044 |
12.7 |
5.0 |
Language Statistics - (Unique tokens)
Dataset |
Adjectives |
Nouns |
Verbs |
Vocabulary |
|
0.2 K |
0.9 K |
0.6 K |
2.3 K |
|
0.1 K |
0.6 K |
0.4 K |
1.3 K |
|
0.6 K |
4.1 K |
1.9 K |
7.5 K |
|
1.1 K |
7.4 K |
3.7 K |
15.4 K |
|
5.3 K |
35.5 K |
13.1 K |
61.4 K |
02 - Benchmark Results
-
Evaluation metric: Recall@k for IoU=m (link).
-
NOTE: For Activitynet-Captions, val1 / val2 or a combination of the two splits is used for evaluation. The most common choice is to use val1 as a validation set and val2 as a testing set. This is necessary as the official test set is withheld for competitions purposes.
Methods can be classified in:
- FS: Fully supervised
- WS: Weakly supervised
- RL: Reinforcement Learning
Format
* `Model` [ID](link) | `Features` | R@k IoU=m |...| R@k IoU=m | Method |
Hit the paper ID
to fast-forward to the paper details (link to pdf, venue, year, author and link to GitHub repo).
ActivityNet Captions (val 1)
Models | Features | R@1 IoU0.3 |
R@1 IoU0.5 |
R@1 IoU0.7 |
R@5 IoU0.3 |
R@5 IoU0.5 |
R@5 IoU0.7 |
Method |
---|---|---|---|---|---|---|---|---|
ACRN [12] | C3D | 31.29 | 16.17 | - | - | - | - | FS |
A2C [19] | C3D | - | 36.90 | - | - | - | - | RL |
DEBUG [27] | C3D | 55.91 | 39.72 | - | - | - | - | FS |
ExCL [28] | I3D | 63.00 | 43.60 | 23.60 | - | - | - | FS |
TSP-PRL [37] | C3D | 56.08 | 38.76 | - | - | - | - | RL |
GDP [40] | C3D | 56.17 | 39.27 | - | - | - | - | FS |
DRN [41] | C3D | - | 42.49 | 22.25 | - | 71.85 | 45.96 | FS |
VSLNet [48] | I3D | 63.16 | 43.22 | 26.16 | - | - | - | FS |
ActivityNet Captions (val 2)
Models | Features | R@1 IoU0.3 |
R@1 IoU0.5 |
R@1 IoU0.7 |
R@5 IoU0.3 |
R@5 IoU0.5 |
R@5 IoU0.7 |
Method |
---|---|---|---|---|---|---|---|---|
CTRL [6] | C3D | 47.43 | 29.01 | - | 75.32 | 59.17 | - | FS |
TGN [10] | C3D VGG16 Inception-V4 |
43.81 42.24 45.51 |
27.93 23.90 28.47 |
11.86 - - |
54.56 51.82 57.32 |
44.20 40.17 43.33 |
24.84 - - |
FS |
QSPN [17] | C3D | 52.12 | 33.26 | - | 77.72 | 62.39 | - | FS |
WSDEC-W [26] | 62.7 | 42.00 | 23.3 | - | - | - | WS | |
WSLLN [26] | 75.4 | 42.80 | 22.7 | - | - | - | WS | |
CMIN [29] | C3D | 64.41 | 44.62 | 24.48 | 82.39 | 69.66 | 52.96 | FS |
2D-TAN (pool) [38] | C3D | 59.45 | 44.51 | 26.54 | 85.53 | 77.13 | 61.96 | FS |
2D-TAN (conv) [38] | C3D | 58.75 | 44.05 | 27.38 | 85.65 | 76.65 | 62.26 | FS |
SCN [39] | C3D | 47.23 | 29.22 | - | 71.45 | 55.69 | - | WS |
DRN [41] | C3D | - | 45.45 | 24.36 | - | 77.97 | 50.30 | FS |
HVTG [45] | OBJ | 57.60 | 40.15 | 18.27 | - | - | - | FS |
PMI [46] | C3D | 59.69 | 38.28 | 17.83 | - | - | - | FS |
DPIN [54] | C3D | 62.40 | 47.27 | 28.31 | 87.52 | 77.45 | 60.03 | FS |
FIAN [55] | C3D | 64.10 | 47.90 | 29.81 | 87.59 | 77.64 | 59.66 | FS |
CSMGAN [56] | C3D | 68.52 | 49.11 | 29.15 | 87.68 | 77.43 | 59.63 | FS |
SMRN [58] | C3D | - | 42.97 | 26.79 | - | 76.46 | 60.51 | FS |
VLG-Net [67] | C3D | - | 46.32 | 29.82 | - | 77.15 | 63.33 | FS |
ActivityNet Captions (val 1 + val2)
Models | Features | R@1 IoU0.3 |
R@1 IoU0.5 |
R@1 IoU0.7 |
R@5 IoU0.3 |
R@5 IoU0.5 |
R@5 IoU0.7 |
Method |
---|---|---|---|---|---|---|---|---|
QSPN [17] | C3D | 45.30 | 27.70 | 13.60 | 75.70 | 59.20 | 38.30 | FS |
ABLR [20] | C3D | 55.67 | 36.79 | - | - | - | - | RL |
SCDM [25] | C3D | 54.80 | 36.75 | 19.86 | 77.29 | 64.99 | 41.53 | FS |
CBP [36] | C3D | 54.30 | 35.76 | 17.80 | 77.63 | 65.89 | 46.20 | FS |
LGI [43] | C3D | 58.52 | 41.51 | 23.07 | - | - | - | FS |
TripNet [47] | C3D | 48.42 | 32.19 | 13.93 | - | - | - | RL |
TMLGA [49] | I3D | 51.28 | 33.04 | 19.26 | - | - | - | FS |
TACoS (test)
Models | Features | R@1 IoU0.1 |
R@1 IoU0.3 |
R@1 IoU0.5 |
R@1 IoU0.7 |
R@5 IoU0.1 |
R@5 IoU0.3 |
R@5 IoU0.5 |
R@5 IoU0.7 |
Method |
---|---|---|---|---|---|---|---|---|---|---|
CTRL [6] | C3D | 24.32 | 18.32 | 13.30 | - | 48.73 | 36.69 | 25.42 | - | FS |
TGN [10] | C3D | 41.87 | 21.77 | 18.90 | 11.88 | 53.40 | 39.06 | 31.02 | 15.26 | FS |
ACRN [12] | C3D | 24.22 | 19.52 | 14.62 | - | 47.42 | 34.97 | 24.88 | - | FS |
MCF [13] | C3D | 25.84 | 18.64 | 12.53 | - | 52.96 | 37.13 | 24.73 | - | FS |
ROLE [14] | C3D | 20.37 | 15.38 | 9.94 | - | 45.45 | 31.17 | 20.13 | - | FS |
VAL [15] | C3D | 25.74 | 19.76 | 14.74 | - | 51.87 | 38.55 | 26.52 | - | FS |
QSPN [17] | C3D | 25.31 | 20.15 | 15.23 | - | 53.21 | 36.72 | 25.30 | - | FS |
ABLR [20] | C3D | 34.70 | 19.50 | 9.40 | - | - | - | - | - | FS |
SAP [21] | VGG16 | 31.15 | - | 18.24 | - | 53.51 | - | 28.11 | - | FS |
SMRL [24] | VGG16 | 26.51 | 20.25 | 15.95 | - | 50.01 | 38.47 | 27.84 | - | RL |
SCDM [25] | C3D | - | 26.11 | 21.17 | - | - | 40.16 | 32.18 | - | FS |
DEBUG [27] | C3D | 41.15 | 23.45 | 11.72 | - | - | - | - | - | FS |
ExCL [28] | I3D | - | 45.50 | 28.00 | 13.80 | - | - | - | - | FS |
CMIN [29] | C3D I3D |
36.88 41.73 |
27.33 32.35 |
19.57 22.54 |
- - |
64.93 69.15 |
43.35 50.75 |
28.53 32.11 |
- - |
FS |
SLTA [31] | C3D + FRCNN |
23.13 | 17.07 | 11.92 | - | 46.52 | 32.90 | 20.86 | - | FS |
ACL-K [32] | C3D | 31.64 | 24.17 | 20.01 | - | 57.85 | 42.15 | 30.66 | - | FS |
CBP [36] | C3D | - | 27.31 | 24.79 | 19.10 | - | 43.64 | 37.40 | 25.59 | FS |
2D-TAN (Pool) [38] | C3D | 47.59 | 37.29 | 25.32 | - | 70.31 | 57.81 | 45.04 | - | FS |
2D-TAN convl) [38] | C3D | 46.44 | 35.22 | 25.19 | - | 74.43 | 56.94 | 44.21 | - | FS |
GDP [40] | C3D | 39.68 | 24.14 | 13.50 | - | - | - | - | - | FS |
DRN [41] | C3D | - | - | 23.17 | - | - | - | 33.36 | - | FS |
TripNet [47] | C3D | - | 23.95 | 19.17 | 9.52 | - | - | - | - | RL |
VSLNet [48] | I3D | 29.61 | 24.27 | 20.03 | - | - | - | - | - | FS |
TMLGA [49] | I3D | - | 24.54 | 21.65 | 16.46 | - | - | - | - | FS |
DPIN [54] | C3D | 59.04 | 46.74 | 32.92 | - | 75.78 | 62.16 | 50.26 | - | FS |
FIAN [55] | C3D | 39.55 | 33.87 | 28.58 | - | 56.14 | 47.76 | 39.16 | - | FS |
CSMGAN [56] | C3D | 42.74 | 33.90 | 27.09 | - | 68.97 | 53.98 | 41.22 | - | FS |
SMRN [58] | C3D | 50.44 | 42.49 | 32.07 | - | 77.28 | 66.63 | 52.84 | - | FS |
LGN [64] | C3D | 52.46 | 41.71 | 30.57 | - | 76.86 | 63.06 | 50.76 | - | FS |
VLG-Net [67] | C3D | 57.21 | 45.46 | 34.19 | - | 81.80 | 70.38 | 56.56 | - | FS |
DiDeMo (test)
Models | Features | R@1 IoU0.5 |
R@1 IoU0.7 |
R@1 IoU1.0 |
R@5 IoU0.5 |
R@5 IoU0.7 |
R@5 IoU1.0 |
Method |
---|---|---|---|---|---|---|---|---|
MCN [5] | VGG16 Flow VGG16+Flow VGG16+Flow+TEF |
- - - - |
- - - - |
13.10 18.35 19.88 28.10 |
- - - - |
- - - - |
44.82 56.25 62.39 78.21 |
FS |
TMN [9] | VGG16 Flow VGG16+Flow |
- - |
- - - |
18.71 19.90 22.92 |
- - - |
- - - |
72.97 75.14 76.08 |
FS |
TGN [10] | VGG16 Flow VGG16+Flow |
- - |
- - - |
24.28 27.52 28.23 |
- - - |
- - - |
71.43 76.94 79.26 |
FS |
ACRN [12] | VGG16 | 27.44 | 16.65 | - | 69.43 | 29.45 | - | FS |
ROLE [14] | VGG16 | 29.40 | 15.68 | - | 70.72 | 33.08 | - | FS |
MAN [22] | TAN | - | - | 27.02 | - | - | 81.70 | FS |
TGA [23] | VGG16+Flow | - | - | 12.19 | - | - | 39.74 | WS |
SMRL [24] | VGG16+FRCNN | - | - | 31.06 | - | - | 80.45 | RL |
WSLLN [26] | VGG16 Flow |
- - |
- - |
19.40 18.40 |
- - |
- - |
53.10 54.40 |
WS |
SLTA [31] | VGG16+FRCNN | 30.92 | 17.16 | - | 70.18 | 33.87 | - | FS |
VLANet [44] | VGG16 | - | - | 19.32 | - | - | 65.68 | WS |
RTBPN [51] | VGG16 Flow VGG16+Flow |
- - |
- - - |
20.38 20.52 20.79 |
- - - |
- - - |
55.88 57.72 60.26 |
WS |
VLG-Net [67] | VGG16 | 33.35 | 25.57 | 25.57 | 88.86 | 71.72 | 71.65 | FS |
LoGAN [69] | VGG16+Flow | - | - | 39.20 | - | - | 64.04 | WS |
Charades-STA (test)
Models | Features | R@1 IoU0.3 |
R@1 IoU0.5 |
R@1 IoU0.7 |
R@5 IoU0.3 |
R@5 IoU0.5 |
R@5 IoU0.7 |
Method |
---|---|---|---|---|---|---|---|---|
CTRL [6] | C3D | - | 23.63 | 8.89 | - | 58.92 | 29.52 | FS |
ACRN [12] | C3D | - | 20.26 | 7.64 | - | 71.99 | 27.79 | FS |
ROLE [14] | C3D | - | 21.74 | 7.82 | - | 70.37 | 30.06 | FS |
VAL [15] | C3D | - | 23.12 | 9.16 | - | 61.26 | 27.98 | FS |
ASST [16] | C3D | - | 42.72 | 24.06 | - | 71.32 | 43.98 | FS |
QSPN [17] | C3D | 54.70 | 35.60 | 15.80 | 95.80 | 79.40 | 45.40 | FS |
ABLR [20] | C3D | - | 24.36 | 9.01 | - | - | - | FS |
SAP [21] | VGG16 | - | 27.42 | 13.36 | - | 66.37 | 38.15 | FS |
MAN [22] | VGG16 I3D |
- - |
41.24 46.53 |
20.54 22.72 |
- - |
83.21 86.23 |
51.85 53.72 |
FS |
TGA [23] | --- | 32.14 | 19.94 | 8.84 | 56.58 | 65.52 | 33.51 | WS |
SMRL [24] | VGG16 | - | 24.36 | 11.17 | - | 61.25 | 32.08 | RL |
SCDM [25] | I3D | - | 54.44 | 33.43 | - | 74.43 | 58.08 | FS |
DEBUG [27] | C3D | - | 37.39 | 17.69 | - | - | - | FS |
ExCL [28] | I3D | 65.10 | 44.10 | 22.40 | - | - | - | FS |
SLTA [31] | C3D+FRCNN | - | 22.81 | 8.25 | - | 72.39 | 31.46 | FS |
ACL [32] | C3D | - | 26.47 | 11.23 | - | 61.51 | 33.23 | FS |
ACL-K [32] | C3D | - | 30.48 | 12.20 | - | 64.84 | 35.13 | FS |
CBP [36] | C3D | - | 36.80 | 18.87 | - | 70.94 | 50.19 | FS |
TSP-PRL [37] | C3D | - | 37.39 | 17.69 | - | - | - | RL |
TSP-PRL [37] | Two Streams | - | 45.30 | 24.73 | - | - | - | RL |
2D-TAN (pool) [38] | VGG16 | - | 39.70 | 23.31 | - | 80.32 | 51.26 | FS |
2D-TAN (conv) [38] | VGG16 | - | 39.81 | 23.25 | - | 79.33 | 52.15 | FS |
SCN [39] | C3D | 42.96 | 23.58 | 9.97 | 95.56 | 71.80 | 38.87 | WS |
GDP [40] | C3D | - | 39.47 | 18.49 | - | - | - | FS |
DRN [41] | VGG16 C3D I3D |
- - - |
42.90 45.40 53.09 |
23.68 26.40 31.75 |
- - - |
87.80 88.01 89.06 |
54.87 55.38 60.05 |
FS |
LGI [43] | I3D | - | 59.46 | 35.48 | - | - | - | FS |
VLANet [44] | C3D | - | 31.83 | 14.17 | - | 82.85 | 33.09 | WS |
HVTG [45] | FRCNN | - | 47.27 | 23.30 | - | - | - | FS |
PMI [46] | C3D | - | 39.73 | 19.27 | - | - | - | FS |
TripNet [47] | C3D | 51.33 | 38.29 | 16.07 | - | - | - | RL |
VSLNet [48] | I3D | - | 54.19 | 35.22 | - | - | - | FS |
TMLGA [49] | I3D | 67.53 | 52.02 | 33.74 | - | - | - | FS |
RTBPN [51] | C3D | 60.04 | 32.36 | 13.24 | 97.48 | 71.85 | 41.18 | WS |
DPIN [54] | VGG16 | - | 47.98 | 26.96 | - | 85.53 | 55.00 | FS |
FIAN [55] | I3D | - | 58.55 | 37.72 | - | 87.80 | 63.52 | FS |
WSTG [61] | --- | 39.80 | 27.30 | 12.90 | - | - | - | WS |
LGN [64] | VGG16 | - | 48.15 | 26.67 | - | 86.80 | 53.01 | FS |
LoGAN [69] | C3D | - | 34.68 | 14.54 | - | 74.30 | 39.11 | WS |
<!-- | AVMR [53] | ResNet | 77.72 | 54.59 | - | 88.92 | 72.78 | - |
03 - Papers
Markdown format:
* `ID` | `Model Acronym` | `Conference` | [Paper Name](link) | Author 1 et al | [GitHub](link)
Analysis and Surveys
ID | Model | Venue | Title | Authors | Code |
---|---|---|---|---|---|
- | -- |
BMVC 2020 |
Uncovering Hidden Challenges in Query-Based Video Moment Retrieval | Otani et al | |
- | -- |
AAAI 2022 |
A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics | Yuan et al | GitHub |
- | -- |
ArXiv 2021 |
A Survey on Temporal Sentence Grounding in Videos | LAN et al | |
- | -- |
ArXiv 2022 |
he Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions | Zhang et al | |
- | -- |
Arxiv |
A Survey on Natural Language Video Localization | Liu et al |
Early works
ID | Model | Venue | Title | Authors | Code |
---|---|---|---|---|---|
1 | -- |
ACL 2013 |
Grounded Language Learning from Video Described with Sentences | Yu et al | |
2 | -- |
CVPR 2014 |
Visual Semantic Search: Retrieving Videos via Complex Textual Queries | Lin et al | |
3 | -- |
AAAI 2015 |
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework | Xu et al | |
4 | -- |
IJCAI 2016 |
Unsupervised Alignment of Actions in Video with Text Descriptions | Song et al |
2017
ID | Model | Venue | Title | Authors | Code |
---|---|---|---|---|---|
5 | MCN |
ICCV |
Localizing Moments in Video with Natural Language | Hendricks et al | GitHub |
6 | CTRL |
ICCV |
TALL: Temporal Activity Localization via Language Query | Gao et al | GitHub |
7 | -- |
ArXiv |
Where to Play: Retrieval of Video Segments using Natural-Language Queries | Lee et al |
2018
ID | Model | Venue | Title | Authors | Code |
---|---|---|---|---|---|
8 | FIFO |
ECCV |
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries | Shao et al | |
9 | TMN |
ECCV |
Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos | Liu et al | |
10 | TGN |
EMNLP |
Temporally Grounding Natural Sentence in Video | Chen et al | GitHub |
11 | TEMPO |
EMNLP |
Localizing Moments in Video with Temporal Language | Hendricks et al | GitHub |
12 | ACRN |
SIGIR |
Attentive Moment Retrieval in Videos | Liu et al | GitHub |
13 | MCF |
IJCAI |
Multi-modal Circulant Fusion for Video-to-Language and Backward | Wu et al | GitHub |
14 | ROLE |
ACM MM |
Cross-modal Moment Localization in Videos | Liu et al | GitHub |
15 | VAL |
PRCM |
VAL: Visual-attention action localizer | Song et al | |
16 | ASST |
ArXiv |
Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions | Ning et al |
2019
2020
2021
2022
Licenses
To the extent possible under law, muketong all copyright and related or neighboring rights to this work.