collaborative-experts icon indicating copy to clipboard operation
collaborative-experts copied to clipboard

questions about MSVD

Open xixiareone opened this issue 4 years ago • 12 comments

The article mentions that "where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach".Does the training set, validation set and test set all take 5 sentences at random? Not all sentences are used in training set, validation set and test set?

Thank you ver much!

xixiareone avatar May 25 '20 17:05 xixiareone

Hi @xixiareone, do you have a pointer to that sentence (e.g. in which section in the article it appears)? Thanks! For reference, in our setting, we train by randomly sampling from all possible training captions, then we test each sentence query independently.

albanie avatar May 25 '20 17:05 albanie

Sorry, I didn't explain the source. I read other articles to do this,

So I would like to ask you, in the MSVD data set, especially in the test phase, are you going to evaluate all sentences, or just randomly select 5 sentences from the MSVD for evaluation?

xixiareone avatar May 25 '20 17:05 xixiareone

No worries! In the test phase, all sentences are used (independently). The evaluation we use was based on the protocol used here: https://github.com/niluthpol/multimodal_vtt

albanie avatar May 25 '20 17:05 albanie

Eh, I'm a little confused

Like this one in https : //github.com/niluthpol/multimodal_vtt image

npts = videos.shape [0] / 20 Among them, 20 is a video corresponding to 20 descriptions, which is from msr-vtt data set. But the number of videos in the MSVD data set is different. How do you deal with it independently?

This problem really bothers me for a long time. Thank you very much

xixiareone avatar May 25 '20 18:05 xixiareone

I agree it's confusing! I've summarised below my understanding of the evaluation protocols for MSVD.

Design choices

There are two choices to be made for datasets (like MSVD) that contain a variable numbers of sentences per video:

  1. (Assignment) Which sentences should be assigned to each video (i.e. whether they should be subsampled to a fixed number per video, or whether all available sentences should be used?)
  2. (Evaluation) How should the system be evaluated for retrieval performance when multiple sentences are assigned to each video (for example, should multiple sentences be used together to retrieve the target video, or should they be used independently of the knowledge of other description assignments for the same video?)

Previous works using MSVD

When reporting numbers in our paper, I looked at the following papers using MSVD for retrieval, to try to understand the different protocols.

  1. Learning Joint Representations of Videos and Sentences with Web Image Search
  2. Predicting Visual Features from Text for Image and Video Caption Retrieval
  3. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
  4. Dual Encoding for Zero-Example Video Retrieval

I've added my notes below. Direct qutoes from each paper are written in quotation marks.


Learning Joint Representations of Videos and Sentences with Web Image Search

Assignment: "We first divided the dataset into 1,200, 100, and 670 videos for training, validation, and test, respectively, as in [35,34,11]. Then, we extracted five-second clips from each original video in a sliding-window manner. As a result, we obtained 8,001, 628, and 4,499 clips for the training, validation, and test sets, respectively. For each clip, we picked five ground truth descriptions out of those associated with its original video."

Evaluation: (Retrieving videos with text) "Given a video and a query sentence, we extracted five-second video clips from the video and computed Euclidean distances from the query to the clips. We used their median as the distance of the original video and the query. We ranked the videos based on the distance to each query and recorded the rank of the ground truth video." (Retrieving text with videos) "We computed the distances between a sentence and a query video in the same way as the video retrieval task. Note that each video has five ground truth sentences; thus, we recorded the highest rank among them. The test set has 3,500 sentences."

Summary: Taken together, my interpretation is that the authors first randomly assign 5 sentences per video before any experiments are done. They then break the videos into clips and further randomly assign five sentences to each clip (i.e. sampling with replacement from the initial pool of 5 sentences that were assigned to the video). Since the test set has 670 videos and 670 * 5 = 3350, this approximately lines up with the comment that the test set has 3,500 sentences. In terms of evaluation, when retrieving videos with text, each query is performed and evaluated independently.


Predicting Visual Features from Text for Image and Video Caption Retrieval

Assignment: "For the ease of cross-paper comparison, we follow the identical data partitions as used in [5], [7], [58] for images and [60] for videos. That is, training / validation / test is 6k / 1k / 1k for Flickr8k, 29K / 1,014 / 1x k for Flickr30k, and 1,200 / 100 / 670 for MSVD." (The reference [60] here refers to Translating videos to natural language using deep recurrent neural network).

Evaluation: "The training, validation and test set are used for model training, model selection and performance evaluation, respectively, and exclusively. For performance evaluation, each test caption is first vectorized by a trained Word2VisualVec. Given a test image/video query, we then rank all the test captions in terms of their similarities with the image/video query in the visual feature space. The performance is evaluated based on the caption ranking."

Summary: The paper doesn't mention that they perform subsampling to five captions per video, so it's probably safe to assume that they don't. The evaluation code they have made available for image/text retrieval does assume a fixed number of captions per image (with a default value of 5), but (as of 26/05/20) the MSVD code is not available (I guess this is what you were asking about here) and the comment by the author in that issue implies that all sentences (rather than just 5) are used.


Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Assignment: "For a fair comparison, we used the same splits utilized in prior works [32], with 1200 videos for training, 100 videos for validation, and 670 videos for testing. The MSVD dataset is also used in [24] for video-text retrieval task, where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach." (Note the references here are: [24] Learning Joint Representations of Videos and Sentences with Web Image Search, [32] Sequence to Sequence – Video to Text.)

Evaluation: The code given here implements the evaluation protocol as described in Learning Joint Representations of Videos and Sentences with Web Image Search).

Summary: Retrieval performance is reported on two splits in Table 2 and Table 3 of the paper (one described as LJRV, the other as JMET and JMDV). My interpreation was that the LJRV split refers to the practice of sampling 5 descriptions per video, and that the JMET & JMDV split refers to using all captions from a video (since this is what is used in Predicting Visual Features from Text for Image and Video Caption Retrieval, and this number is reported in the table).


Dual Encoding for Zero-Example Video Retrieval

Assignment: I wasn't quite able to determine the splits from the paper, but the comment here suggests that the test set remains the same as each of the other papers above (670 videos).

Evaluation: This is performed in a zero-shot setting.

Suummary: This work is by the same author as Predicting Visual Features from Text for Image and Video Caption Retrieval, so in the absence of extra comments in the paper, it's probably reasonable to assume that the same protocol is used in both works (using all descriptions per video).


Use What You Have: Video Retrieval Using Representations From Collaborative Experts

Assignment: As with all four papers above, this repo uses the 1200, 100, 670 split between train, val and test. It uses all captions associated with each video (sampling one caption per video randomly during training).

Evaluation: (Retrieving videos with text) each query retrieval is evaluated independently of the others, and all test set queries (i.e. more than 5 per video are used). (Retrieving text with videos) As with the other papers above, if a video has multiple descriptions, we evaluate each independently, then take the minimum rank (this is what I was referring to when in my comment above I said that we used the same protocol as Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval).


Summary

My interpretaton is that:

  1. Learning Joint Representations of Videos and Sentences with Web Image Search) and the LJRV split reported by Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval use the same protocol: randomly sampling 5 descriptions per video before any experiments are run, then using these same descriptions throughout.
  2. Predicting Visual Features from Text for Image and Video Caption Retrieval, Dual Encoding for Zero-Example Video Retrieval, the JMET and JMDV split reported by Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval and Use What You Have: Video Retrieval Using Representations From Collaborative Experts use all descriptions for each video.
  3. For all protocols used above, when retrieving text with videos the evaluation is "optimistic" in the sense that the highest rank among the possible retrieved candidates.

If the specific five captions per videos that were sampled for each video (used for the LJRV split) are still available, I will implement it as an additional protocol for our codebase to ensure the comparisons can also be made under this setting (I will follow up with the authors by email to find out).

Either way, an important takeaway is that the protocols are different. Thanks a lot for drawing attention to this issue!

albanie avatar May 26 '20 07:05 albanie

Thank you very much for your reply, I admire your serious attitude I have two small questions about your code: In the MSVD data set

  1. The first question, https://github.com/albanie/collaborative-experts/blob/master/configs/data_loader_msvd.json There is a sentence: "num_test_captions": 81 I would like to ask why the author is 81?
  2. The second question, https://github.com/albanie/collaborative-experts/blob/master/model/metric.py

in def t2v_metrics(sims, query_masks=None):

def t2v_metrics(sims, query_masks=None): """Compute retrieval metrics from a similiarity matrix. Args: sims (th.Tensor): N x M matrix of similarities between embeddings, where x_{i,j} = <text_embd[i], vid_embed[j]> query_masks (th.Tensor): mask any missing queries from the dataset (two videos in MSRVTT only have 19, rather than 20 captions) Returns: (dict[str:float]): retrieval metrics """ assert sims.ndim == 2, "expected a matrix" num_queries, num_vids = sims.shape dists = -sims sorted_dists = np.sort(dists, axis=1) if False: import sys import matplotlib from pathlib import Path matplotlib.use("Agg") import matplotlib.pyplot as plt sys.path.insert(0, str(Path.home() / "coding/src/zsvision/python")) from zsvision.zs_iterm import zs_dispFig # NOQA plt.matshow(dists) zs_dispFig() import ipdb; ipdb.set_trace() # The indices are computed such that they slice out the ground truth distances # from the psuedo-rectangular dist matrix queries_per_video = num_queries // num_vids gt_idx = [[np.ravel_multi_index([ii, jj], (num_queries, num_vids)) for ii in range(jj * queries_per_video, (jj + 1) * queries_per_video)] for jj in range(num_vids)] gt_idx = np.array(gt_idx) gt_dists = dists.reshape(-1)[gt_idx.reshape(-1)] gt_dists = gt_dists[:, np.newaxis] rows, cols = np.where((sorted_dists - gt_dists) == 0) # find column position of GT

This code,queries_per_video = num_queries // num_vids,According to what you said, queries_per_video

------------------ 原始邮件 ------------------ 发件人: "Samuel"<[email protected]>; 发送时间: 2020年5月26日(星期二) 下午3:45 收件人: "albanie/collaborative-experts"<[email protected]>; 抄送: "Gin"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [albanie/collaborative-experts] questions about MSVD (#11)

I agree it's confusing! I've summarised below my understanding of the evaluation protocols for MSVD.

Design choices

There are two choices to be made for datasets (like MSVD) that contain a variable numbers of sentences per video:

(Assignment) Which sentences should be assigned to each video (i.e. whether they should be subsampled to a fixed number per video, or whether all available sentences should be used?)

(Evaluation) How should the system be evaluated for retrieval performance when multiple sentences are assigned to each video (for example, should multiple sentences be used together to retrieve the target video, or should they be used independently of the knowledge of other description assignments for the same video?)

Previous works using MSVD

When reporting numbers in our paper, I looked at the following papers using MSVD for retrieval, to try to understand the different protocols.

Learning Joint Representations of Videos and Sentences with Web Image Search

Predicting Visual Features from Text for Image and Video Caption Retrieval

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Dual Encoding for Zero-Example Video Retrieval

I've added my notes below. Direct qutoes from each paper are written in quotation marks.

Learning Joint Representations of Videos and Sentences with Web Image Search

Assignment: "We first divided the dataset into 1,200, 100, and 670 videos for training, validation, and test, respectively, as in [35,34,11]. Then, we extracted five-second clips from each original video in a sliding-window manner. As a result, we obtained 8,001, 628, and 4,499 clips for the training, validation, and test sets, respectively. For each clip, we picked five ground truth descriptions out of those associated with its original video."

Evaluation: (Retrieving videos with text) "Given a video and a query sentence, we extracted five-second video clips from the video and computed Euclidean distances from the query to the clips. We used their median as the distance of the original video and the query. We ranked the videos based on the distance to each query and recorded the rank of the ground truth video." (Retrieving text with videos) "We computed the distances between a sentence and a query video in the same way as the video retrieval task. Note that each video has five ground truth sentences; thus, we recorded the highest rank among them. The test set has 3,500 sentences."

Summary: Taken together, my interpretation is that the authors first randomly assign 5 sentences per video before any experiments are done. They then break the videos into clips and further randomly assign five sentences to each clip (i.e. sampling with replacement from the initial pool of 5 sentences that were assigned to the video). Since the test set has 670 videos and 670 * 5 = 3350, this approximately lines up with the comment that the test set has 3,500 sentences. In terms of evaluation, when retrieving videos with text, each query is performed and evaluated independently.

Predicting Visual Features from Text for Image and Video Caption Retrieval

Assignment: "For the ease of cross-paper comparison, we follow the identical data partitions as used in [5], [7], [58] for images and [60] for videos. That is, training / validation / test is 6k / 1k / 1k for Flickr8k, 29K / 1,014 / 1x k for Flickr30k, and 1,200 / 100 / 670 for MSVD." (The reference [60] here refers to Translating videos to natural language using deep recurrent neural network).

Evaluation: "The training, validation and test set are used for model training, model selection and performance evaluation, respectively, and exclusively. For performance evaluation, each test caption is first vectorized by a trained Word2VisualVec. Given a test image/video query, we then rank all the test captions in terms of their similarities with the image/video query in the visual feature space. The performance is evaluated based on the caption ranking."

Summary: The paper doesn't mention that they perform subsampling to five captions per video, so it's probably safe to assume that they don't. The evaluation code they have made available for image/text retrieval does assume a fixed number of captions per image (with a default value of 5), but (as of 26/05/20) the MSVD code is not available (I guess this is what you were asking about here) and the comment by the author in that issue implies that all sentences (rather than just 5) are used.

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

Assignment: "For a fair comparison, we used the same splits utilized in prior works [32], with 1200 videos for training, 100 videos for validation, and 670 videos for testing. The MSVD dataset is also used in [24] for video-text retrieval task, where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach." (Note the references here are: [24] Learning Joint Representations of Videos and Sentences with Web Image Search, [32] Sequence to Sequence – Video to Text.)

Evaluation: The code given here implements the evaluation protocol as described in Learning Joint Representations of Videos and Sentences with Web Image Search).

Summary: Retrieval performance is reported on two splits in Table 2 and Table 3 of the paper (one described as LJRV, the other as JMET and JMDV). My interpreation was that the LJRV split refers to the practice of sampling 5 descriptions per video, and that the JMET & JMDV split refers to using all captions from a video (since this is what is used in Predicting Visual Features from Text for Image and Video Caption Retrieval, and this number is reported in the table).

Dual Encoding for Zero-Example Video Retrieval

Assignment: I wasn't quite able to determine the splits from the paper, but the comment here suggests that the test set remains the same as each of the other papers above (670 videos).

Evaluation: This is performed in a zero-shot setting.

Suummary: This work is by the same author as Predicting Visual Features from Text for Image and Video Caption Retrieval, so in the absence of extra comments in the paper, it's probably reasonable to assume that the same protocol is used in both works (using all descriptions per video).

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

Assignment: As with all four papers above, this repo uses the 1200, 100, 670 split between train, val and test. It uses all captions associated with each video (sampling one caption per video randomly during training).

Evaluation: (Retrieving videos with text) each query retrieval is evaluated independently of the others, and all test set queries (i.e. more than 5 per video are used). (Retrieving text with videos) As with the other papers above, if a video has multiple descriptions, we evaluate each independently, then take the minimum rank (this is what I was referring to when in my comment above I said that we used the same protocol as Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval).

Summary

My interpretaton is that:

Learning Joint Representations of Videos and Sentences with Web Image Search) and the LJRV split reported by Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval use the same protocol: randomly sampling 5 descriptions per video before any experiments are run, then using these same descriptions throughout.

Predicting Visual Features from Text for Image and Video Caption Retrieval, Dual Encoding for Zero-Example Video Retrieval, the JMET and JMDV split reported by Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval and Use What You Have: Video Retrieval Using Representations From Collaborative Experts use all descriptions for each video.

For all protocols used above, when retrieving text with videos the evaluation is "optimistic" in the sense that the highest rank among the possible retrieved candidates.

If the specific five captions per videos that were sampled for each video (used for the LJRV split) are still available, I will implement it as an additional protocol for our codebase to ensure the comparisons can also be made under this setting (I will follow up with the authors by email to find out).

Either way, an important takeaway is that the protocols are different. Thanks a lot for drawing attention to this issue!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xixiareone avatar May 26 '20 14:05 xixiareone

Hi @xixiareone,

81 is the maximum number of sentences for a single video used in MSVD. For efficiency, we compute a similarity matrix with a fixed number of sentences per video (we use 81 since this corresponds to the maximum number of sentences assigned to any individual video). We then mask out all the positions that correspond to sentences that are not present (for videos that have fewer than 81 captions) so that they do not affect the evaluation.

I'm sorry, I didn't quite understand the second part of your question?

albanie avatar May 26 '20 14:05 albanie

Sorry, I got it wrong I mean Are queries_per_video equal to 81?

------------------ 原始邮件 ------------------ 发件人: "Samuel"<[email protected]>; 发送时间: 2020年5月26日(星期二) 晚上10:22 收件人: "albanie/collaborative-experts"<[email protected]>; 抄送: "Gin"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [albanie/collaborative-experts] questions about MSVD (#11)

Hi @xixiareone,

81 is the maximum number of sentences for a single video used in MSVD. For efficiency, we compute a similarity matrix with a fixed number of sentences per video (we use 81 since this corresponds to the maximum number of sentences assigned to any individual video). We then mask out all the positions that correspond to sentences that are not present (for videos that have fewer than 81 captions) so that they do not affect the evaluation.

I'm sorry, I didn't quite understand the second part of your question?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xixiareone avatar May 26 '20 14:05 xixiareone

Yes, for MSVD, that's correct.

albanie avatar May 26 '20 15:05 albanie

Thank you very much for your answer, I wish you all the best in scientific research! I am very happy because the puzzle is solved!

------------------ 原始邮件 ------------------ 发件人: "Samuel"<[email protected]>; 发送时间: 2020年5月27日(星期三) 凌晨0:00 收件人: "albanie/collaborative-experts"<[email protected]>; 抄送: "Gin"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [albanie/collaborative-experts] questions about MSVD (#11)

Yes, for MSVD, that's correct.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xixiareone avatar May 26 '20 16:05 xixiareone

I am so sorry to bother you again,I try to download the trianed_model.pth in this: export MODEL=data/models/msvd-train-full-ce/296d24f9/seed-0/2020-01-26_02-02-33/trained_model.pth

but it is not? is 404? In addition, other models of MSVD are also built, but they cannot be used for MSVD testing.

Many thanks!  

------------------ 原始邮件 ------------------ 发件人: "Samuel"<[email protected]>; 发送时间: 2020年5月27日(星期三) 凌晨0:00 收件人: "albanie/collaborative-experts"<[email protected]>; 抄送: "Gin"<[email protected]>;"Mention"<[email protected]>; 主题: Re: [albanie/collaborative-experts] questions about MSVD (#11)

Yes, for MSVD, that's correct.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

xixiareone avatar May 28 '20 14:05 xixiareone

No problem! Do you have a reference for where this model checkpoint comes from? The example given on the main README is:

# fetch the pretrained experts for MSVD 
python3 misc/sync_experts.py --dataset MSVD

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"

# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config

Do these steps fail for you?

albanie avatar May 31 '20 18:05 albanie