tapnet T 0 visibility at higer resoultion

In the query I've tried to add some point a 't 0 ' with an higher resolution input (> 256x256). Why I find that many points were decimated by the visibility check at t 0 directly that is de-facto my GT?

Is there something in the classifier on in expected_distance strictly correlated to the 256x256 train input size?

Sep 05 '23 18:09 bhack

Hi bhack,

Thanks for the observation. Here are some more information,

The current model is trained with 256x256 resolution, the huber_loss and expected_distance are both set under the 256x256 resolution. If you look at our latest commit https://github.com/deepmind/tapnet/commit/e8cda0708d68915e90683a3d246cfb0ed095bdec, we attempt to explicitly standardize the default training setup to 256x256 resolution currently.
During inference, our model treats the query frame as simply an unknown frame and also inference on there. Hence it produces a prediction on the query frame as well and it is possible the model makes a wrong prediction which is different from the groundtruth query points.
When your video resolution is higher than 256x256, the model runs in a pyramid way. It first infer the points under 256x256 resolution, then infer on a higher resolution, until it reaches your final video resolution. Each time, it uses previous resolution as initial location and iteratively update. The iterative update may turn occlusion flags off.

One sanity check is how the model predicts 't 0' points at 256x256 resolution, is it working mostly of the time?

Sep 06 '23 10:09 yangyi02

One sanity check is how the model predicts 't 0' points at 256x256 resolution, is it working mostly of the time?

I've still not experimented on the same sequence at 256x256 but honestly I suppose that it is ok. When I have a free slot I will test it.

During inference, our model treats the query frame as simply an unknown frame and also inference on there. Hence it produces a prediction on the query frame as well and it is possible the model makes a wrong prediction which is different from the groundtruth query points.

Yes I know in this case the receptive field difference/drift in the refining pyramid without retraining at higher res could cause some problem about the "filters context".

The current model is trained with 256x256 resolution, the huber_loss and expected_distance are both set under the 256x256 resolution. If you look at our latest commit https://github.com/deepmind/tapnet/commit/e8cda0708d68915e90683a3d246cfb0ed095bdec, we attempt to explicitly standardize the default training setup to 256x256 resolution currently.

Yes this point is what I am interested more about the visibility check componet. Like e.g. in the colab:

  outputs = tree.map_structure(lambda x: np.array(x[0]), outputs)
  tracks, occlusions, expected_dist = outputs['tracks'], outputs['occlusion'], outputs['expected_dist']

Is there a trick to at least "scale back" at the inference the expected_dist?

Sep 06 '23 11:09 bhack

You can just ignore expected_dist if you find it's being too conservative (or raise it to some power less than 1 to reduce its impact), or use the output from an earlier level of the pyramid (you might need a minor code change to make the model return this). The model is (approximately) estimating whether the result is within an 8 pixel threshold at every resolution; relative to the image size, this threshold gets smaller and smaller later in the pyramid. Thus later layers might be more conservative, especially if there's smaller textureless regions. I expect this to be somewhat application-dependent though.

It's hard to say more without seeing your specific videos.

Sep 06 '23 18:09 cdoersch

The model is (approximately) estimating whether the result is within an 8 pixel threshold at every resolution; relative to the image size, this threshold gets smaller and smaller later in the pyramid. Thus later layers might be more conservative, especially if there's smaller textureless regions.

With the pre-trained checkpoint if we increase the inference input resolution enough as the pyramid level are not much (just 3?) it could also have a side effect in rapid coordinates bumping but I don't know if you have any ablation about the pyramidal gap between layers.

Sep 06 '23 18:09 bhack

Or are you talking about only the "pyramid" effect in iterative approach at section 4 and not about the "feature" backbone pyramid?

Extension to High-Resolution Videos When running at a given level, we use the position estimate from the prior level, but we directly re-initialize the occlusion and uncertainty estimates from the track initialization as we find the model otherwise tends to become overconfident. The final output is then the average output across all refinement resolutions.

Sep 07 '23 01:09 bhack

I also have and extra doubt about section 4. As you are going to work with iterative refinements at 2x iter you are still locked in the "receptive field" of the 256x256 ResNet. How you are going to recover this gap?

Sep 07 '23 11:09 bhack

don't know if you have any ablation about the pyramidal gap between layers

No such ablation currently.

"pyramid" effect in iterative approach at section 4

Yes, I was talking about the pyramid that's required for running at higher resolution than the training resolution.

How you are going to recover this gap?

The ResNet doesn't have a very large receptive field (it's similar to ResNet18, but with less striding). So in practice we don't have to do anything to deal with this; the PIPs iterations work fine on higher resolution, provided you transform the coordinates correctly.

Sep 13 '23 16:09 cdoersch

Closing due to inactivity; the question has been answered as far as I can tell.

Aug 09 '24 10:08 cdoersch