Structured-Self-Attentive-Sentence-Embedding Penalty Term Frobenius Norm Squared

def Frobenius(mat):
    size = mat.size()
    if len(size) == 3:  # batched matrix
        ret = (torch.sum(torch.sum((mat ** 2), 1), 2).squeeze() + 1e-10) ** 0.5
        return torch.sum(ret) / size[0]
    else:
        raise Exception('matrix for computing Frobenius norm should be with 3 dims')

In the code above, the Frobenius Form of the Matrix is calculated as ret, and averaged over batch dimension. However, in the original paper, the norm is squared as the penalty term. Is it intended? Or It does not matter too much I wonder. Thanks!

Sep 07 '17 13:09 Shuailong

The Frobenius norm has a sqrt() operation, which is not necessary if we are optimizing it. The difference is just a matter of speed I think.

Sep 07 '17 18:09 hantek

Another issue I run into with this code is that the first sum operation reduces the number of dimensions to 2, but the outer sum is then over the no-longer-existing dimension 2. So either the dimensions should be reversed:

ret = (torch.sum(torch.sum((mat ** 2), 2), 1).squeeze() + 1e-10) ** 0.5

or perhaps keepdim=True should be passed to sum?

Also, are you saying the sqrt can be removed as an optimization?

Sep 08 '17 13:09 andreasvc

Sorry for the late reply.

You are right, the above code in the first post should raise a dimension mismatch error.

Yes. I think the sqrt could be removed to reduce computations, but I haven't compared that.

Sep 21 '17 04:09 hantek

In the APPENDIX of your paper, you have mentioned a method called batcheddot, can you show me the detail about batcheddot. When you compute the relation(Fr), you have done an element-wise product of Fh and Fp, why not using Mh and Mp directly? Can you show me the shape of each tensor Mh Mp Fh Fp? Looking forward to your reply: )

Oct 27 '17 06:10 jx00109

The "batched_dot" is just the batched_dot() function in Theano.

$M_h$ and $M_p$ are of shape (u, r); $F_h$ and $F_p$ are of shape (h, r); where h is the number of hidden states in the $W_{fp}$ matrix.

Please refer to this part if you want to look into implementation details: https://github.com/hantek/SelfAttentiveSentEmbed/blob/master/util_layers.py#L353-L356

For the reason why not directly multiply $M_h$ and $M_p$, it is because we want the hidden state $F_r$ to represent the relation between the two given sentences. The "gated Encoder" part is inspired by a model in vision: https://www.iro.umontreal.ca/~memisevr/pubs/pami_relational.pdf , and corresponds to the "factored gated autoencoder" in that paper. In short, $W_{fh}$ and $W_{fp}$ are necessary transformations for $F_r$ to be only related to the relative relation of the two embeddings.

Oct 30 '17 21:10 hantek

May I recommend:

def Frobenius(mat):
    assert len( mat.shape )==3, 'matrix for computing Frobenius norm should be with 3 dims'
    return torch.sum( (torch.sum(torch.sum((mat ** 2), 2), 1) ) ** 0.5 )/mat.shape[0]

Nov 18 '18 21:11 manuelsh