fairseq
fairseq copied to clipboard
Inferior performance of VideoClip on Video-text retrieval task using COIN dataset.
We test the performance of VideoClip through the video-text retrieval task on the COIN dataset, but the performance is much lower than the reported performance of VideoQA (26%<< 74%), which can be formulated as a video-text retrieval task, in the paper.
We follow the inference demo and search for the most similar label from the task-level candidate label pool for every video clip in the COIN dataset. The accuracy is about 26% (<< 74% reported on MSR-VTT). Considering the domain shift from HowTo100M to MSR-VTT and the domain shift from HowTo100M to COIN, we wish VideoClip to perform better on COIN. Is there any possible reason might cause the inferior performance on COIN, or what else in code is worth noticing? Thanks a lot!
Hi, could you please share your package versions and pip version and anything related? I can't seem to make the example run on my computer.
@qingy1337 Perhaps the following might work for you? I got SignCLIP running, which is based on VideoCLIP and should have the same requirements, using:
- this script which uses conda to set up a python 3.8.8 env
- this requirements.txt , which has some specific versions listed, used by the setup script above