merlot icon indicating copy to clipboard operation
merlot copied to clipboard

Issue on the model scalablity due to segment-level positional embeddings

Open SCZwangxiao opened this issue 3 years ago • 0 comments

I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training. For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:

  1. How to extract features for extremely long videos like movies?
  2. How about using fixed positional embeddings instead of learned ones?

SCZwangxiao avatar Nov 23 '22 02:11 SCZwangxiao