VideoMoCo
VideoMoCo copied to clipboard
How does the adversarial loss work precisely?
Backpropagation through a masking operation does not seem easy, because you are essentially applying a non-differentiable operation onto a subset of frames. In this implementation, even though a list of continuous (float) values controls the set of discrete indices to mask out via a topk selection, the question of how to route the backward gradient from the masked video to that list of values rather than through the source video itself (as would probably happen without a custom backward
implementation) remains. The crucial part of the code is https://github.com/tinapan-pt/VideoMoCo/blob/main/moco/builder.py#L92, where indices
is derived from list_out
, which itself is predicted by another neural network. In the backward
method, we can interpret grad_output
as the derivative of the loss with respect to the postprocessed (masked) video, and the return value grad_list
must be the derivative of the loss with respect to list_out
, such that aforementioned neural network can update its weights in order to eventually optimize the adversarial objective. While I understand that grad_im
must remain None because of the desired gradient flow, my question is as follows -- what is the purpose of calculating the sum of grad_output
along spatial dimensions (C, H, W)
, and assiging it directly to grad_list
? Intuitively speaking, every value in grad_output
says something like "if we were to make this pixel brighter, it would change the loss by this amount", but I am not sure if there even exists a simple and accurate way of converting this information to a directly usable derivative for list_out
. Right now, if we amortize all the steps (including how masking frames means replacing them with the mean pixel value), it reads as "if one frame of the masked video has to be made brighter on average to increase the loss, the corresponding grad_list
value will be positive, otherwise negative". I am wondering as to how this gradient manages to update list_out
and therefore the masking operation in a correct way, because I am confused what the brightness (i.e. average pixel value) of a frame has to do with its difficulty (or lack thereof), which according to the paper is presumably the real underlying metric you want to optimize for. Perhaps I am missing some piece of the puzzle in terms of understanding how the gradient calculations here actually relate to the adversarial loss, so any insight would be appreciated. Thank you!