howto100m
howto100m copied to clipboard
Evaluation protocol
Why “Learning a Text-Video Embedding from Incomplete and Heterogeneous Data” and “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips” evaluation protocol different?
Is there a test set of 1k-A and 1k-B each representing 1000 randomly sampled text-video pairs?
I am very confused