xmodaler
xmodaler copied to clipboard
How to extract a global video feature based on butd?
I notice that butd output a 'npz' file corresponding to a single image. When i want to extract video caption based on xmodaler, it requires a global video feature.
How to extract the final video feature from butd output of multi frames?
In MSRVTT dataset, I attempted to use topN objects which are voted by multi frames that extracted from a video uniformly. But captions of video is in poor quality. The BLUE of evaluate and test set only up to 0.6 and many <UNK> in captions.