pytorch-i3d
pytorch-i3d copied to clipboard
F.upsample(per_frame_logits)
I was wondering what the difference is between this and the tensorflow model that requires the output to be upsampled? As I understand the number of frames gets downsampled by a factor of 4 (from 64 frames to 16 predictions) and then F-upsample is used to get this back up to 64 to match the label (which is per frame).
However I don't see this in the tensorflow code and was wondering if the up-sampling is just defined in the model directly?
Oh, is it because in the tensorflow model they only care about classifying the whole clip without any localisation so they just average the logits (global average pooling):
averaged_logits = tf.reduce_mean(logits, axis=1)
Whereas here for the localisation you want the logits on a per-frame level, so you return the full logits from the model.py file.
In which case, if you are training this on a custom dataset where you just want classification loss you would