Issue on the model scalablity due to segment-level positional embeddings

Open SCZwangxiao opened this issue 3 years ago • 0 comments

I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training. For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:

How to extract features for extremely long videos like movies?
How about using fixed positional embeddings instead of learned ones?

Nov 23 '22 02:11 SCZwangxiao