How to extract a global video feature based on butd?

Open HanielF opened this issue 3 years ago • 0 comments

I notice that butd output a 'npz' file corresponding to a single image. When i want to extract video caption based on xmodaler, it requires a global video feature.

How to extract the final video feature from butd output of multi frames?

In MSRVTT dataset, I attempted to use topN objects which are voted by multi frames that extracted from a video uniformly. But captions of video is in poor quality. The BLUE of evaluate and test set only up to 0.6 and many <UNK> in captions.

Mar 14 '22 13:03 HanielF