ViS4mer icon indicating copy to clipboard operation
ViS4mer copied to clipboard

lvu_durations.csv

Open nbgundavarapu opened this issue 2 years ago • 8 comments

Hi authors,

How are the durations in lvu_durations.csv computed? The last 20s in most videos show preview for other videos. Does lvu_durations.csv show the number of seconds in the video excluding the preview duration?

Thanks

nbgundavarapu avatar Oct 19 '22 23:10 nbgundavarapu

These lines of code https://github.com/md-mohaiminul/ViS4mer/blob/2a2442bb0fc84b6823150c73b00e713179c59b7c/extract_features/extract_features_lvu_vit.py#L65-L68 suggests that these previews are used in training and evaluation. Could you confirm? Thanks!

nbgundavarapu avatar Oct 26 '22 19:10 nbgundavarapu

Hi, Thanks for reaching out. We used the duration from Condensed Movies dataset. They removed the outro/preview from each video which they describe in section 3.1 of their Paper. Therefore, lvu_durations.csv does not contain the outro/preview of each video.

md-mohaiminul avatar Oct 28 '22 14:10 md-mohaiminul

Thanks for your reply! Do the downloaded mp4 videos have outro/preview removed?

If not, in the following code, outro/preview seems to be included and the same is being used later in training/evals. https://github.com/md-mohaiminul/ViS4mer/blob/2a2442bb0fc84b6823150c73b00e713179c59b7c/extract_features/extract_features_lvu_vit.py#L58-L68

e.g. Consider the video 9NG5mJgw6Yg in writer set with duration = 154s, and the actual video length = 184s. Above code will include frames after 154s containing outro/preview.

nbgundavarapu avatar Nov 01 '22 01:11 nbgundavarapu

In the above example, could you walk through the above code from your codebase, at i=153? idx = int(184/154*153) = 183 Hence, features[153] = model_fwd(video[183])

In effect, features[153] contains outro frame 183. So, during LVU evals, frame 183 will be used for this video which is not what you intended. This looks like a bug. The same is true for a lot of videos and frames.

nbgundavarapu avatar Nov 14 '22 01:11 nbgundavarapu

Hi, I think you are right. You need to remove the outro first and we also did that. You can use the duration from 'lvu_durations.csv' to do this.

md-mohaiminul avatar Nov 14 '22 04:11 md-mohaiminul

Thanks! Could you please check and confirm if the reported results in the paper contain outro by any chance in light of the above bug? The current state of the codebase is definitely using the outro.

Context: I'm struggling to reproduce results from the paper. There is a 1% difference in performance if I include/exclude the outro, and including the outro puts the results close to the reported results in the paper.

nbgundavarapu avatar Nov 15 '22 04:11 nbgundavarapu

Which task did you try and what performance are you getting? Also, how did you solve the 'NaN' issue? Can you please reply that on the other issue so that other's can benefit from it?

md-mohaiminul avatar Nov 15 '22 04:11 md-mohaiminul

I've not been able to solve the NaN issue. I'm working on a reimplementation in jax building upon annotated-s4

I've tried all the classification tasks. There is a ~1% gap in relationship, director, writer, speaking including/excluding the outro.

nbgundavarapu avatar Nov 15 '22 19:11 nbgundavarapu