models icon indicating copy to clipboard operation
models copied to clipboard

Weird temporal sampling in implementation of Spatiotemporal Contrastive Video Representation Learning

Open DianeBouchacourt opened this issue 3 years ago • 0 comments

Reading the original paper seems like here we should be using linear temporal decreasing distribution (as you mention here https://github.com/tensorflow/models/blob/5d3df060cf36850138c8e4683b6201dfc56c8eee/official/vision/beta/projects/video_ssl/dataloaders/video_ssl_input.py#L80)

However I don't get your implementation of the power CDF here https://github.com/tensorflow/models/blob/5d3df060cf36850138c8e4683b6201dfc56c8eee/official/vision/beta/projects/video_ssl/ops/video_ssl_preprocess_ops.py#L343

Let's write for clarity max_offset = T and power = p. Following your cdf implementationI get the CDF as F(k) = - k^(p+1) / (p * T^(p+1)) + k * (p + 1) / (pT) which gives the PDF as P(k) = - k^p * (p+1) / (p T^(p+1) + (p+1)/(pT) = (p+1) / (pT) * (1 - k^p / T^p)

This is definitely not x^power as mentioned here https://github.com/tensorflow/models/blob/5d3df060cf36850138c8e4683b6201dfc56c8eee/official/vision/beta/projects/video_ssl/ops/video_ssl_preprocess_ops.py#L344

while I would understand the intuition behind using a pdf as 1 - k^p / T^p, I don't get the term in front. Can you elaborate?

DianeBouchacourt avatar Dec 01 '21 14:12 DianeBouchacourt