Pretrained models performing poorly on dense video captioning
The HowTo100M + VidChapters-7M + ViTT model is performing poorly on dense video captioning.
Reproduction:
Run
yt-dlp -P $TRANSFORMERS_CACHE -o video.mp4 https://www.youtube.com/watch?v=WJPyrQqLTl4
to download this specific video.
Follow the steps in the demo using the HowTo100M + VidChapters-7M + ViTT checkpoint.
Output captions:
[{'sentence': 'Dress up for the ceremony.', 'timestamp': [350.1055353535354, 395.77147474747477]}, {'sentence': 'Take a photo with me.', 'timestamp': [395.77147474747477, 426.21543434343437]}, {'sentence': 'Take a photo with me.', 'timestamp': [426.21543434343437, 456.659393939394]}, {'sentence': 'Take a photo with me.', 'timestamp': [471.88137373737374, 487.10335353535356]}, {'sentence': 'Take a photo with me.', 'timestamp': [487.10335353535356, 517.5473131313131]}, {'sentence': 'Take a picture with me.', 'timestamp': [547.9912727272728, 578.4352323232324]}, {'sentence': 'Take a picture with me.', 'timestamp': [593.6572121212122, 608.879191919192]}, {'sentence': 'Take a picture with me.', 'timestamp': [639.3231515151516, 654.5451313131314]}, {'sentence': 'Take a picture with me.', 'timestamp': [669.7671111111111, 684.9890909090909]}, {'sentence': 'Take a picture with me.', 'timestamp': [684.9890909090909, 700.2110707070708]}, {'sentence': 'Take a picture with me.', 'timestamp': [730.6550303030302, 745.8770101010102]}, {'sentence': 'Take a picture with me.', 'timestamp': [745.8770101010102, 761.0989898989899]}, {'sentence': 'Take a picture with me.', 'timestamp': [791.5429494949495, 806.7649292929293]}, {'sentence': 'Take a picture with me.', 'timestamp': [806.7649292929293, 821.9869090909092]}, {'sentence': 'Take a picture with me.', 'timestamp': [837.208888888889, 852.4308686868687]}, {'sentence': 'Take a picture with me.', 'timestamp': [852.4308686868687, 867.6528484848486]}, {'sentence': 'Take a picture with me.', 'timestamp': [882.8748282828284, 913.318787878788]}, {'sentence': 'Take a picture with me.', 'timestamp': [913.318787878788, 928.5407676767677]}, {'sentence': 'Take a picture with me.', 'timestamp': [928.5407676767677, 943.7627474747475]}, {'sentence': 'Take a picture with me.', 'timestamp': [958.9847272727274, 974.2067070707071]}, {'sentence': 'Take a picture with me.', 'timestamp': [989.4286868686869, 1019.8726464646466]}, {'sentence': 'Take a picture with me.', 'timestamp': [1035.0946262626262, 1065.538585858586]}, {'sentence': 'Take a picture with me.', 'timestamp': [1080.7605656565656, 1111.2045252525254]}, {'sentence': 'Take a picture with me.', 'timestamp': [1111.2045252525254, 1141.648484848485]}, {'sentence': 'Take a picture with me.', 'timestamp': [1141.648484848485, 1156.8704646464648]}, {'sentence': 'Take a picture with me.', 'timestamp': [1156.8704646464648, 1172.0924444444445]}, {'sentence': 'Take a picture with me.', 'timestamp': [1172.0924444444445, 1202.536404040404]}, {'sentence': 'Take a picture with me.', 'timestamp': [1202.536404040404, 1217.758383838384]}, {'sentence': 'Take a', 'timestamp': [1217.758383838384, 1232.9803636363638]}]
I have the same issue with the HowTo100M + VidChapters-7M + YouCook2 model. For this video the model gives these captions:
[{'sentence': 'Take a look at the hood.', 'timestamp': [3.534141414141414, 10.602424242424242]}, {'sentence': 'Take a look at the hood.', 'timestamp': [12.369494949494948, 22.97191919191919]}, {'sentence': 'Take a look at the hood.', 'timestamp': [22.97191919191919, 33.57434343434343]}, {'sentence': 'Take a look at the hood.', 'timestamp': [35.34141414141414, 45.94383838383838]}, {'sentence': 'Take a look at the hood.', 'timestamp': [45.94383838383838, 56.546262626262624]}, {'sentence': 'Take a look at the hood.', 'timestamp': [56.546262626262624, 65.38161616161617]}, {'sentence': 'Take a look at the hood.', 'timestamp': [65.38161616161617, 74.2169696969697]}, {'sentence': 'Take a look at the hood.', 'timestamp': [77.75111111111111, 86.58646464646465]}, {'sentence': 'Take a look at the hood.', 'timestamp': [86.58646464646465, 95.42181818181818]}, {'sentence': 'Take a look at the hood.', 'timestamp': [97.1888888888889, 107.79131313131313]}, {'sentence': 'Take a look at the hood.', 'timestamp': [109.55838383838385, 118.39373737373737]}, {'sentence': 'Take a look at the hood.', 'timestamp': [120.16080808080808, 123.69494949494948]}, {'sentence': 'Take a look at the hood.', 'timestamp': [123.69494949494948, 132.53030303030303]}, {'sentence': 'Take a look at the hood.', 'timestamp': [132.53030303030303, 139.59858585858586]}, {'sentence': 'Take a look at the hood.', 'timestamp': [139.59858585858586, 144.89979797979797]}, {'sentence': 'Take a look at the hood.', 'timestamp': [144.89979797979797, 150.2010101010101]}, {'sentence': 'Take a look at the hood.', 'timestamp': [150.2010101010101, 155.50222222222223]}, {'sentence': 'Take a look at the hood.', 'timestamp': [155.50222222222223, 160.80343434343433]}]
while for this video it gives:
[{'sentence': 'Introduktion.', 'timestamp': [0.0, 3.2334545454545456]}, {'sentence': 'ffningen.', 'timestamp': [3.2334545454545456, 9.700363636363637]}, {'sentence': 'ffningen.', 'timestamp': [9.700363636363637, 21.017454545454545]}, {'sentence': 'ffningen.', 'timestamp': [21.017454545454545, 30.717818181818185]}, {'sentence': 'ffningen.', 'timestamp': [30.717818181818185, 42.03490909090909]}, {'sentence': 'ffningen.', 'timestamp': [42.03490909090909, 51.73527272727273]}, {'sentence': 'ffningen.', 'timestamp': [51.73527272727273, 61.43563636363637]}, {'sentence': 'ffningen.', 'timestamp': [61.43563636363637, 75.98618181818182]}, {'sentence': 'ffningen.', 'timestamp': [75.98618181818182, 87.30327272727274]}, {'sentence': 'ffningen.', 'timestamp': [87.30327272727274, 98.62036363636365]}, {'sentence': 'ffningen.', 'timestamp': [98.62036363636365, 113.17090909090909]}, {'sentence': 'ffningen.', 'timestamp': [113.17090909090909, 127.72145454545455]}, {'sentence': 'ffningen.', 'timestamp': [127.72145454545455, 147.12218181818184]}, {'sentence': 'ffningen.', 'timestamp': [147.12218181818184, 151.97236363636364]}]