x-transformers
x-transformers copied to clipboard
Question about Xtransformer
Dear Author,
I tried Xtransformer for machine translation task, and got val loss==0.0 at the very first epoch. I don't know where I did it wrong. Please advice.
Here is how I init the model:
model = XTransformer( dim = 512, pad_value = 0, enc_num_tokens = INPUT_DIM, enc_depth = 4, enc_heads = 16, enc_max_seq_len = ENC_MAX_LEN, enc_attn_dropout = 0.1, enc_ff_dropout = 0.1, enc_attn_dim_head = 32, enc_emb_dropout = 0.1, dec_num_tokens = OUTPUT_DIM, dec_depth = 4, dec_heads = 16, dec_emb_dropout = 0.1, dec_max_seq_len = DEC_MAX_LEN, dec_attn_dropout = 0.1, dec_ff_dropout = 0.1, dec_attn_dim_head = 32, tie_token_emb = False # tie embeddings of encoder and decoder )
The above params worked for another transformer implementation. But I wanted to try Xtransformer since you have added a lot of functionalities to it.
Thanks a lot!
@yzhang-github-pub hmm, so i would first run this as a sanity check
are you using masks in the encoder? the mask that XTransformers receive is expected to be True for attend, False for no attention
First, I tested enc_dec_copy.py, and it works.
Second, I used pad masks for both source and target, and masks are made with this function:
def make_mask(input, pad_value): # input is a tensor of shape [batch, sequence len, and embedding dim] # pad_value is 0 mask = (input != pad_value) return mask
then loss: loss = model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
@yzhang-github-pub you can actually omit the target mask
how does your training loss look like? could you share your experiment?
I will re-run the training without target mask to see if results are any different. I will share training loss then. Regarding sharing experiment, what do you want me to share? The training loop is similar to your example, data is handled by torch DataLoader.