fairseq Inferior performance of VideoClip on Video-text retrieval task using COIN dataset.

Inferior performance of VideoClip on Video-text retrieval task using COIN dataset.

Open DuL1nk opened this issue 3 years ago • 2 comments

We test the performance of VideoClip through the video-text retrieval task on the COIN dataset, but the performance is much lower than the reported performance of VideoQA (26%<< 74%), which can be formulated as a video-text retrieval task, in the paper.

We follow the inference demo and search for the most similar label from the task-level candidate label pool for every video clip in the COIN dataset. The accuracy is about 26% (<< 74% reported on MSR-VTT). Considering the domain shift from HowTo100M to MSR-VTT and the domain shift from HowTo100M to COIN, we wish VideoClip to perform better on COIN. Is there any possible reason might cause the inferior performance on COIN, or what else in code is worth noticing? Thanks a lot!

May 10 '22 13:05 DuL1nk

Hi, could you please share your package versions and pip version and anything related? I can't seem to make the example run on my computer.

Sep 05 '24 22:09 qingy1337

@qingy1337 Perhaps the following might work for you? I got SignCLIP running, which is based on VideoCLIP and should have the same requirements, using:

this script which uses conda to set up a python 3.8.8 env
this requirements.txt , which has some specific versions listed, used by the setup script above

Dec 16 '24 18:12 cleong110

fairseq fairseq copied to clipboard

Inferior performance of VideoClip on Video-text retrieval task using COIN dataset.

fairseq
fairseq copied to clipboard