Segment and CLIP-score filtering
I am looking for the part where you mentioned in your paper, segment the video and use the pyscenedetect algorithm to make it CLIP.
It doesn't seem to be uploaded to the repository yet. When can I see the code?
Or if you can point me to a reference, I would appreciate it.
Segment and CLIP-score filtering
As the point tracking system works in a short time window, we begin by using the annotations provided, curated or otherwise, to split each video into segments, and then run PySceneDetect [10] to further break each segment into short video clips with consistent shots. However, the detected video clips may not always be relevant to their associated text annotations. Thus, we use the pretrained CLIP [101] visual and text encoders to compute the cosine similarity score between each clip and text pair, and filter out clips with < 0.25 scores.
Hi,
Thank you very much for your interest in our work! I have uploaded a folder named video_processing with some reference scripts. To perform scene segmentation, please take a look at run_detect_segments.py. There are explanations in the code and it should be easily modifiable in case you want to adapt it to a different dataset.
To perform CLIP filtering, please first run run_clip_filtering.py before running curate_list.py. Similarly, there are more detailed descriptions in the scripts.
Please take a look and let us know if you have any questions.
@rxtan2 Thanks for the quick response!
I need a text prompt for clip filtering, but this part is still missing. I'm curious how you input the prompt, can you provide some sample data?
I also want to know how to annotate the start and end of vid_ann in the run_detect_segments.py.
I need a text prompt for clip filtering, but this part is still missing.
Do you mean an example text annotation from one of the used datasets?
annotate the start and end of vid_ann
These start and end annotations, typically in the form of start and end seconds in the corresponding videos, should come along with the original annotations in datasets like Epic-Kitchens and Ego4D. Do you mean an example from these datasets?