VideoMoCo icon indicating copy to clipboard operation
VideoMoCo copied to clipboard

How does the adversarial loss work precisely?

Open basilevh opened this issue 2 years ago • 0 comments

Backpropagation through a masking operation does not seem easy, because you are essentially applying a non-differentiable operation onto a subset of frames. In this implementation, even though a list of continuous (float) values controls the set of discrete indices to mask out via a topk selection, the question of how to route the backward gradient from the masked video to that list of values rather than through the source video itself (as would probably happen without a custom backward implementation) remains. The crucial part of the code is https://github.com/tinapan-pt/VideoMoCo/blob/main/moco/builder.py#L92, where indices is derived from list_out, which itself is predicted by another neural network. In the backward method, we can interpret grad_output as the derivative of the loss with respect to the postprocessed (masked) video, and the return value grad_list must be the derivative of the loss with respect to list_out, such that aforementioned neural network can update its weights in order to eventually optimize the adversarial objective. While I understand that grad_im must remain None because of the desired gradient flow, my question is as follows -- what is the purpose of calculating the sum of grad_output along spatial dimensions (C, H, W), and assiging it directly to grad_list? Intuitively speaking, every value in grad_output says something like "if we were to make this pixel brighter, it would change the loss by this amount", but I am not sure if there even exists a simple and accurate way of converting this information to a directly usable derivative for list_out. Right now, if we amortize all the steps (including how masking frames means replacing them with the mean pixel value), it reads as "if one frame of the masked video has to be made brighter on average to increase the loss, the corresponding grad_list value will be positive, otherwise negative". I am wondering as to how this gradient manages to update list_out and therefore the masking operation in a correct way, because I am confused what the brightness (i.e. average pixel value) of a frame has to do with its difficulty (or lack thereof), which according to the paper is presumably the real underlying metric you want to optimize for. Perhaps I am missing some piece of the puzzle in terms of understanding how the gradient calculations here actually relate to the adversarial loss, so any insight would be appreciated. Thank you!

basilevh avatar Sep 24 '22 20:09 basilevh