CLIP4Clip icon indicating copy to clipboard operation
CLIP4Clip copied to clipboard

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Results 28 CLIP4Clip issues
Sort by recently updated
recently updated
newest added

Hello, thanks for providing the code! In regards to the models used for generating the results in the paper, is this uploaded anywhere that can be shared? I would be...

how long have you been training

In your paper, you conducted experiments for 5 epochs. In reference to this issue (https://github.com/ArrowLuo/CLIP4Clip/issues/36), it is mentioned that you reported performance based on the best scores on the validation...

Can you provide the parameters after using how-to-100 for post-pretraining, or can you provide the parameters trained for downstream tasks? Due to limited computing resources, I would like to obtain...

I edit two main things: 1. Deleting the "loss.mean()" that do nothing. DDP provides automatically gradient synchronization. 2. Refer to this comment, https://github.com/openai/CLIP/issues/132#issuecomment-908004353 we will do every similarity calculation locally....

https://github.com/ArrowLuo/CLIP4Clip/blob/508ffa3de39ba0563a03199c440ab602a72e9b6f/modules/modeling.py#L400 ``` if self.training: visual_output = allgather(visual_output, self.task_config) video_mask = allgather(video_mask, self.task_config) sequence_output = allgather(sequence_output, self.task_config) torch.distributed.barrier() visual_output = visual_output / visual_output.norm(dim=-1, keepdim=True) visual_output = self._mean_pooling_for_similarity_visual(visual_output, video_mask) visual_output = visual_output...

Hi, I'm a beginner and would like to ask a question. What do Pair, L, T stand for in the code? What do they mean? ``` # Pair x L...

In main_task_retrieval.py, fuction "train_epoch", we can see: ``` python if n_gpu > 1: loss = loss.mean() # mean() to average on multi-gpu. ``` But in modeling.py, there is: ``` python...

Sorry to disturb you. When I reproduce the results on LSMDC dataset, I get worse results than those in paper. In the meanP experiment, the meanR is always around 200,...

Hi authors! Thanks for the great work! I saw that is paper is evaluated on all kinds of video-to-text dataset. CLIP model itself works pretty well for image-to-image retrieval, despite...