merlot
merlot copied to clipboard
Issue on the model scalablity due to segment-level positional embeddings
I notice that MERLOT adopts segment-level positional embeddings. However, there are only 16 segments during pre-training. For longer videos, e.g., movies, 16 segments are not enough to encode their information. Specifically, I have two questions:
- How to extract features for extremely long videos like movies?
- How about using fixed positional embeddings instead of learned ones?