CLIP4Clip
CLIP4Clip copied to clipboard
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
Hello, thanks for providing the code! In regards to the models used for generating the results in the paper, is this uploaded anywhere that can be shared? I would be...
how long have you been training
In your paper, you conducted experiments for 5 epochs. In reference to this issue (https://github.com/ArrowLuo/CLIP4Clip/issues/36), it is mentioned that you reported performance based on the best scores on the validation...
Can you provide the parameters after using how-to-100 for post-pretraining, or can you provide the parameters trained for downstream tasks? Due to limited computing resources, I would like to obtain...
I edit two main things: 1. Deleting the "loss.mean()" that do nothing. DDP provides automatically gradient synchronization. 2. Refer to this comment, https://github.com/openai/CLIP/issues/132#issuecomment-908004353 we will do every similarity calculation locally....
https://github.com/ArrowLuo/CLIP4Clip/blob/508ffa3de39ba0563a03199c440ab602a72e9b6f/modules/modeling.py#L400 ``` if self.training: visual_output = allgather(visual_output, self.task_config) video_mask = allgather(video_mask, self.task_config) sequence_output = allgather(sequence_output, self.task_config) torch.distributed.barrier() visual_output = visual_output / visual_output.norm(dim=-1, keepdim=True) visual_output = self._mean_pooling_for_similarity_visual(visual_output, video_mask) visual_output = visual_output...
Hi, I'm a beginner and would like to ask a question. What do Pair, L, T stand for in the code? What do they mean? ``` # Pair x L...
In main_task_retrieval.py, fuction "train_epoch", we can see: ``` python if n_gpu > 1: loss = loss.mean() # mean() to average on multi-gpu. ``` But in modeling.py, there is: ``` python...
Sorry to disturb you. When I reproduce the results on LSMDC dataset, I get worse results than those in paper. In the meanP experiment, the meanR is always around 200,...
Hi authors! Thanks for the great work! I saw that is paper is evaluated on all kinds of video-to-text dataset. CLIP model itself works pretty well for image-to-image retrieval, despite...