Junjie Wang

Results 24 comments of Junjie Wang

@pytorchbot merge -f "Failed test is not related to this PR and internal tests no related failure"

If they are all tensors, `scaled_dot_product_attention ` should work as long as we pass in correct sizes?

This is kind of OK because we have a trick for this so that we only use bias from rank0 (local rank)

Suggesting to use MLE loss not sum.