Junjie Wang
Results
24
comments of
Junjie Wang
@pytorchbot merge -f "Failed test is not related to this PR and internal tests no related failure"
If they are all tensors, `scaled_dot_product_attention ` should work as long as we pass in correct sizes?
This is kind of OK because we have a trick for this so that we only use bias from rank0 (local rank)
Suggesting to use MLE loss not sum.