Synthesizer Implementation of random synthesizer attention seems to be problematic

Implementation of random synthesizer attention seems to be problematic

Open tianylin98 opened this issue 4 years ago • 2 comments

In RandomAttention, seems like you're using a batch of random matrix for synthetic random attention map:

self.random_attn = torch.randn(batch_size, n_head, max_seq_len, max_seq_len, requires_grad = True)

I don't think this is right. I think it should be a single tensor of size (n_head, max_seq_len, max_seq_len) and expanded (repeated) batch_size times during training. Am I right?

Dec 08 '20 09:12 tianylin98

same issue for Factorized version:)

Dec 08 '20 09:12 tianylin98

In RandomAttention, seems like you're using a batch of random matrix for synthetic random attention map:
self.random_attn = torch.randn(batch_size, n_head, max_seq_len, max_seq_len, requires_grad = True)
I don't think this is right. I think it should be a single tensor of size (n_head, max_seq_len, max_seq_len) and expanded (repeated) batch_size times during training. Am I right?

Agree with you. And the random_attn isn't included into the model by using nn.Parameter().

Feb 16 '21 09:02 TianmingQiu

Synthesizer Synthesizer copied to clipboard

Implementation of random synthesizer attention seems to be problematic

Synthesizer
Synthesizer copied to clipboard