Synthesizer
Synthesizer copied to clipboard
Implementation of random synthesizer attention seems to be problematic
In RandomAttention
, seems like you're using a batch of random matrix for synthetic random attention map:
self.random_attn = torch.randn(batch_size, n_head, max_seq_len, max_seq_len, requires_grad = True)
I don't think this is right. I think it should be a single tensor of size (n_head, max_seq_len, max_seq_len)
and expanded (repeated) batch_size
times during training. Am I right?
same issue for Factorized version:)
In
RandomAttention
, seems like you're using a batch of random matrix for synthetic random attention map:self.random_attn = torch.randn(batch_size, n_head, max_seq_len, max_seq_len, requires_grad = True)
I don't think this is right. I think it should be a single tensor of size
(n_head, max_seq_len, max_seq_len)
and expanded (repeated)batch_size
times during training. Am I right?
Agree with you. And the random_attn
isn't included into the model by using nn.Parameter()
.