Ben Trevett
Ben Trevett
For evaluation (measuring the validation/test loss) we have to always generate the exact same amount of tokens as in the actual target sequence because that is how we measure our...
This is because we have a target sequence, `trg`, of something like `[, A, B, C, ]`. We want our decoder to predict what the next item in the predicted...
> > This is because we have a target sequence, `trg`, of something like `[, A, B, C, ]`. We want our decoder to predict what the next item in...
Beam search is something I am planning to implement when I get the time.
Not sure I understand the question, sorry. Are you asking why we use `pack_padded_sequence` in notebook 4?
@yugaljain1999 We can try running some code to help us understand the packed sequences batching. ```python import torch import torch.nn as nn max_length = 10 batch_size = 3 emb_dim =...
I feel like this is more of an implementation issue, or personal preference. The way I've structured the tutorials (and the way I think about these things) is that if...
When we have a sequence length of one, which we do when decoding, then `output == hidden`, as `output` is the hidden state from all time-steps, and the `hidden` is...
Not sure how I messed this up. Will look into it further. Thanks for pointing it out.
You can do what is done in the `MultiHeadAttentionLayer` and split the `hid_dim` into multiple "heads", but as it stands you have to do elementwise operations. What are you trying...