What do DIV and FLT stand for?
I see there are 3 subsets: DIV, FLT, and the aesthetic version. What are the filtering criteria used for DIV and FLT, and what do they stand for?
DIV and FLT stand for diverse sampling and filtering respectively. Specifically, for DIV (diversity sampling), we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies. For FLT (filtering), we applied a series of filtering strategies to video data alongside DIV sampling. These included: a) Removing video clips shorter than 1s (approximately 23.15% of the total) or longer than 120s (around 0.84% of the total). b) Computing CLIPScore for each video clip using a randomly sampled frame from the clip with OpenAI’s CLIP-ViT-L/14, then selecting clips within the top 30% of CLIPScores. c) Sampling 10M out of the remaining clips using DIV sampling. You can refer to the Sec. E.1. of appendix of this paper.
Got it, and thanks for the fast response! 4 follow-ups (the first one is the most important):
- Have you released the JSONL for the full set of 230M clips?
After the filtering, we get total 234M video clips whose durations range from 2s to more than 30s.
-
Does the aesthetic dataset do any sort of filtering by CLIP score? (I'm guessing not, but wanted to confirm) Also, how did you determine what a high aesthetic score was? (Top 10%? Above some constant? etc.)
-
Is this passage:
we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies
Saying "if there are many clips from the same video, we sample those clips less" (presumably in order to avoid over sampling from longer videos?)
- Is there a reason you used CLIPScore using CLIP-ViT-L/14 instead of using the UMT_Score when calculating video-caption similarity?
Apologies for the delayed response.
- You can access the full version of InternVid here.
- No. The aesthetic dataset does not consider CLIP score. When filtering by aesthetic scores, we establish a threshold by selecting the top 10%.
- Yes. You can review the sampling code snippet here.
- We also computed scores from UMT. We prioritize ClipScore as it is widely recognized and utilized, whereas UMT, at that time, had less recognition. Emphasizing UMT scores in the paper could lead to unnecessary confusion among reviewers.