models
models copied to clipboard
Weird temporal sampling in implementation of Spatiotemporal Contrastive Video Representation Learning
Reading the original paper seems like here we should be using linear temporal decreasing distribution (as you mention here https://github.com/tensorflow/models/blob/5d3df060cf36850138c8e4683b6201dfc56c8eee/official/vision/beta/projects/video_ssl/dataloaders/video_ssl_input.py#L80)
However I don't get your implementation of the power CDF here https://github.com/tensorflow/models/blob/5d3df060cf36850138c8e4683b6201dfc56c8eee/official/vision/beta/projects/video_ssl/ops/video_ssl_preprocess_ops.py#L343
Let's write for clarity max_offset = T and power = p. Following your cdf implementationI get the CDF as F(k) = - k^(p+1) / (p * T^(p+1)) + k * (p + 1) / (pT) which gives the PDF as P(k) = - k^p * (p+1) / (p T^(p+1) + (p+1)/(pT) = (p+1) / (pT) * (1 - k^p / T^p)
This is definitely not x^power as mentioned here https://github.com/tensorflow/models/blob/5d3df060cf36850138c8e4683b6201dfc56c8eee/official/vision/beta/projects/video_ssl/ops/video_ssl_preprocess_ops.py#L344
while I would understand the intuition behind using a pdf as 1 - k^p / T^p, I don't get the term in front. Can you elaborate?