Penalty Term Frobenius Norm Squared
def Frobenius(mat):
size = mat.size()
if len(size) == 3: # batched matrix
ret = (torch.sum(torch.sum((mat ** 2), 1), 2).squeeze() + 1e-10) ** 0.5
return torch.sum(ret) / size[0]
else:
raise Exception('matrix for computing Frobenius norm should be with 3 dims')
In the code above, the Frobenius Form of the Matrix is calculated as ret, and averaged over batch dimension. However, in the original paper, the norm is squared as the penalty term. Is it intended? Or It does not matter too much I wonder. Thanks!
The Frobenius norm has a sqrt() operation, which is not necessary if we are optimizing it. The difference is just a matter of speed I think.
Another issue I run into with this code is that the first sum operation reduces the number of dimensions to 2, but the outer sum is then over the no-longer-existing dimension 2. So either the dimensions should be reversed:
ret = (torch.sum(torch.sum((mat ** 2), 2), 1).squeeze() + 1e-10) ** 0.5
or perhaps keepdim=True should be passed to sum?
Also, are you saying the sqrt can be removed as an optimization?
Sorry for the late reply.
You are right, the above code in the first post should raise a dimension mismatch error.
Yes. I think the sqrt could be removed to reduce computations, but I haven't compared that.
In the APPENDIX of your paper, you have mentioned a method called batcheddot, can you show me the detail about batcheddot. When you compute the relation(Fr), you have done an element-wise product of Fh and Fp, why not using Mh and Mp directly? Can you show me the shape of each tensor Mh Mp Fh Fp? Looking forward to your reply: )
The "batched_dot" is just the batched_dot() function in Theano.
$M_h$ and $M_p$ are of shape (u, r); $F_h$ and $F_p$ are of shape (h, r); where h is the number of hidden states in the $W_{fp}$ matrix.
Please refer to this part if you want to look into implementation details: https://github.com/hantek/SelfAttentiveSentEmbed/blob/master/util_layers.py#L353-L356
For the reason why not directly multiply $M_h$ and $M_p$, it is because we want the hidden state $F_r$ to represent the relation between the two given sentences. The "gated Encoder" part is inspired by a model in vision: https://www.iro.umontreal.ca/~memisevr/pubs/pami_relational.pdf , and corresponds to the "factored gated autoencoder" in that paper. In short, $W_{fh}$ and $W_{fp}$ are necessary transformations for $F_r$ to be only related to the relative relation of the two embeddings.
May I recommend:
def Frobenius(mat):
assert len( mat.shape )==3, 'matrix for computing Frobenius norm should be with 3 dims'
return torch.sum( (torch.sum(torch.sum((mat ** 2), 2), 1) ) ** 0.5 )/mat.shape[0]