perceiver-pytorch
perceiver-pytorch copied to clipboard
Percevier_IO mask
Hi,guy.I learing you code to attention mask code in perceiver io.
if exists(mask):
mask = rearrange(mask, 'b ... -> b (...)')
max_neg_value = -torch.finfo(sim.dtype).max
mask = repeat(mask, 'b j -> (b h) () j', h = h) # mask (b,h,1,j)
sim.masked_fill_(~mask, max_neg_value)
If it is the step of encode, it can be well understood as whether the input information in the mask is mapped to the hidden space, but in the decode part, the logic of this code is not explained. Mask in lantent space array mapping to Output array in lantent dimension, What's means?
@xesdiny the latents never need any masking at all https://github.com/lucidrains/perceiver-pytorch/blob/main/perceiver_pytorch/perceiver_io.py#L168
Hi, not sure if I'm missing a key detail here but from what I can see the mask in this implementation would not work like a normal transformer.
The mask as applied here allows you to control which latents get information from which part of the input sequence (i.e the mask is b x n_latents x src_seq_len
.
To match existing transformers concept of an attention mask (or at least the one used in autoregressive LM's) the mask would need to be b x trg_seq_len x src_seq_len
. You would then need a seperate set of latents for each unique row in the mask wrt the trg_seq_len.
something like
src_seq_len = 3
trg_seq_len = 5
data = torch.randn(batch_size, src_seq_len, features)
queries = torch.randn(batch_size, trg_seq_len, features)
mask = torch.tesor([
[1, 0, 0],
[1, 0, 0],
[1, 1, 0],
[1, 1, 0],
[1, 1, 1]
])
x = repeat(self.latents, 'n d -> b trg_len n d', b = b, trg_len = trg_seq_len)
x = cross_attention(x, data, mask) # b x trg_seq_len x latent x features
# we now have trg_seq_len sets of latent values, each latent has no information about seq items it has been masked from
for layers in ....:
# stuff
latents = self.decoder_cross_attn(queries, context = x)
Does this makes sense, am I missing something?