Has anyone had success visualizing attention weights for images and text tokens. I'm really interested in seeing why the model is selecting tokens.
In the training loop we have: ``` imgs = imgs.to(device=args.device) logits, target = self.model(imgs) loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), target.reshape(-1)) loss.backward() ``` However, the output of the transformer is: ``` _,...
cosine schedule calculates the number of tokens which are UNMASKED