x-transformers icon indicating copy to clipboard operation
x-transformers copied to clipboard

Question about Xtransformer

Open yzhang-github-pub opened this issue 3 years ago • 4 comments

Dear Author,

I tried Xtransformer for machine translation task, and got val loss==0.0 at the very first epoch. I don't know where I did it wrong. Please advice.

Here is how I init the model:

model = XTransformer( dim = 512, pad_value = 0, enc_num_tokens = INPUT_DIM, enc_depth = 4, enc_heads = 16, enc_max_seq_len = ENC_MAX_LEN, enc_attn_dropout = 0.1, enc_ff_dropout = 0.1, enc_attn_dim_head = 32, enc_emb_dropout = 0.1, dec_num_tokens = OUTPUT_DIM, dec_depth = 4, dec_heads = 16, dec_emb_dropout = 0.1, dec_max_seq_len = DEC_MAX_LEN, dec_attn_dropout = 0.1, dec_ff_dropout = 0.1, dec_attn_dim_head = 32, tie_token_emb = False # tie embeddings of encoder and decoder )

The above params worked for another transformer implementation. But I wanted to try Xtransformer since you have added a lot of functionalities to it.

Thanks a lot!

yzhang-github-pub avatar Aug 08 '22 15:08 yzhang-github-pub

@yzhang-github-pub hmm, so i would first run this as a sanity check

are you using masks in the encoder? the mask that XTransformers receive is expected to be True for attend, False for no attention

lucidrains avatar Aug 08 '22 16:08 lucidrains

First, I tested enc_dec_copy.py, and it works.

Second, I used pad masks for both source and target, and masks are made with this function:

def make_mask(input, pad_value): # input is a tensor of shape [batch, sequence len, and embedding dim] # pad_value is 0 mask = (input != pad_value) return mask

then loss: loss = model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)

yzhang-github-pub avatar Aug 08 '22 17:08 yzhang-github-pub

@yzhang-github-pub you can actually omit the target mask

how does your training loss look like? could you share your experiment?

lucidrains avatar Aug 08 '22 18:08 lucidrains

I will re-run the training without target mask to see if results are any different. I will share training loss then. Regarding sharing experiment, what do you want me to share? The training loop is similar to your example, data is handled by torch DataLoader.

yzhang-github-pub avatar Aug 08 '22 19:08 yzhang-github-pub