pytorch-i3d icon indicating copy to clipboard operation
pytorch-i3d copied to clipboard

F.upsample(per_frame_logits)

Open ilkarman opened this issue 5 years ago • 1 comments

I was wondering what the difference is between this and the tensorflow model that requires the output to be upsampled? As I understand the number of frames gets downsampled by a factor of 4 (from 64 frames to 16 predictions) and then F-upsample is used to get this back up to 64 to match the label (which is per frame).

However I don't see this in the tensorflow code and was wondering if the up-sampling is just defined in the model directly?

ilkarman avatar Jul 18 '19 12:07 ilkarman

Oh, is it because in the tensorflow model they only care about classifying the whole clip without any localisation so they just average the logits (global average pooling):

averaged_logits = tf.reduce_mean(logits, axis=1) Whereas here for the localisation you want the logits on a per-frame level, so you return the full logits from the model.py file.

In which case, if you are training this on a custom dataset where you just want classification loss you would

ilkarman avatar Jul 25 '19 11:07 ilkarman