onnxruntime torch.nn.LayerNorm mismatches in nightly.

Describe the bug torch.nn.LayerNorm mismatches in nightly, but matches in 1.12.1

Urgency None.

System information Nightly torch Nightly onnxruntime

To Reproduce

import io
import torch
from onnxruntime import InferenceSession, SessionOptions

model = torch.nn.LayerNorm([10, 10])
x = torch.randn(20, 5, 10, 10)
torch_out = model(x)

model_onnx = io.BytesIO()
torch.onnx.export(
    model.eval(),
    x,
    model_onnx,
    opset_version=14
)
sess = InferenceSession(model_onnx.getvalue(), SessionOptions(), providers=['CUDAExecutionProvider'])
ort_out = sess.run(None, {"input":x.numpy()})

torch.testing.assert_close([torch_out.detach().numpy()], ort_out, rtol=1e-3, atol=1e-7)

Mismatched elements: 9974 / 10000 (99.7%)
Greatest absolute difference: 1.6739110946655273 at index (5, 4, 2, 0) (up to 1e-07 allowed)
Greatest relative difference: 14660.064374461228 at index (12, 1, 8, 7) (up to 0.001 allowed)

Aug 28 '22 21:08 titaiwangms

@wangyems @tianleiwu for comments

Aug 29 '22 18:08 hariharans29

It does not reproduce in my machine. I used latest nightly version on python 3.8 and Ubuntu 18.04: PyTorch: 1.13.0.dev20220830+cu113 from pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cu113 ort-nightly-gpu: 1.13.0.dev20220830001

Aug 31 '22 03:08 tianleiwu

Thanks for taking a look. I think I might mess up the CUDA version between these two. I will check again this week.

Aug 31 '22 15:08 titaiwangms

I keep the onnx, and use it to run with onnxruntime==1.12.1 and nightly onnxruntime (built with CUDA 11.6) layer_norm.zip

The result doesn't algin.

This is how I built it:

./build.sh --config RelWithDebInfo --enable_training --use_cuda --cuda_home /usr/local/cuda/ --cudnn_home /usr/local/cuda/ --build_wheel --parallel --skip_tests --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=70 --cuda_version=11.6

Noted that opset 17 is aligned, as it's using the newly supported LayerNorm, but the opset 9-16 are all mismatched (they are composed of ReduceMean/Div nodes).

Sep 06 '22 22:09 titaiwangms

cc @justinchuby if you have more insight to provide.

Sep 06 '22 22:09 titaiwangms

Possible caused by --enable_training. In my test, I did not enable training.

Sep 06 '22 22:09 tianleiwu

Indeed, without --enable_training, the issue is solved. Is the mismatch with --enable_training is an expected behavior? Why is there a needed difference?

Sep 06 '22 22:09 titaiwangms

@titaiwangms is this problem resolved? I don't think the mismatch is expected when enable_training is turned on. Are you still seeing this mismatch?

Dec 16 '22 22:12 baijumeswani

onnxruntime onnxruntime copied to clipboard

torch.nn.LayerNorm mismatches in nightly.

onnxruntime
onnxruntime copied to clipboard