transformer-xl
transformer-xl copied to clipboard
qkv computation
In https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py#L105 , I am wondering why only key and value are layernormed, while query is not normed? In other variants (such as RelMultiHeadAttn), the qkv computation is implemented by a single self.qkv_net layer.
if self.pre_lnorm:
##### layer normalization
c = self.layer_norm(c)
head_q = self.q_net(h)
head_k, head_v = torch.chunk(self.kv_net(c), 2, -1)