minimal-llama
minimal-llama copied to clipboard
this training process did not consider decoder_attention_mask?
I see that :
def model_forward(model, inputs): h = inputs h = h.to(model.base_model.model.model.embed_tokens.weight.device) h = model.base_model.model.model.embed_tokens(h) for layer in model.base_model.model.model.layers: h = h.to(layer.input_layernorm.weight.device) h = layer(h)[0] h = h.to(model.base_model.model.model.norm.weight.device) h = model.base_model.model.model.norm(h) h = model.base_model.model.lm_head(h) return h
the output of this model comes from all sequence?
Maybe you need add _prepare_decoder_attention_mask(h) to avoid this...