XMem icon indicating copy to clipboard operation
XMem copied to clipboard

Why do we have different/decreasing`skip_values` as we progress in stage 03 training

Open abdksyed opened this issue 1 year ago • 3 comments

I wanted to know the idea behind the concept of having different max_skip_values, The value starts with 10, increases to 15 and again drops back to 5,5.

Is there any intuition and reason for doing this?

Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage? Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.

abdksyed avatar Oct 10 '24 07:10 abdksyed

It is for curriculum learning: first, from easy to hard cases, then anneal back to 5, which is closer to what is used during inference.

Note that max jump sets the maximum. We don't always sample at the maximum.

See https://arxiv.org/pdf/2103.07941 https://davischallenge.org/challenge2020/papers/DAVIS-Semisupervised-Challenge-1st-Team.pdf

hkchengrex avatar Oct 10 '24 17:10 hkchengrex

Thanks.

Can you also give some intuition on the second part, where how come training on 8 frames videos with max 3 frames in memory, leading to better long video segmentation capability during inference?

Also another question I had was since the number of frames used for training was 8 as mentioned in the paper, how come the model is able to do well for long-videos where frames could be in range of thousands, where the model has never trained for such long form memory usage? Even if the max jump was 15, the highest difference in frames for a single video could be 15*8=120 frames while training.

abdksyed avatar Oct 12 '24 08:10 abdksyed

It generalizes. It is not unlike how CNN generalizes to different resolutions and how LLM generalizes to different sequence lengths with relative position embeddings. Learning a robust appearance representation (as queries/keys) is enough to go a long way. It might not be optimal -- but we don't really have sufficiently long video datasets at the time.

hkchengrex avatar Oct 13 '24 06:10 hkchengrex