Correct error tolerances for golden values on A100
Issue body
After #9975 error tolerances for shark on A100 have been exceeded for a few models. Here are some numbers:
self = <xlm-roberta-base_tf_test.XLMRobertaModuleTester object at 0x7fe155272770>, dynamic = False, device = 'gpu'
E Mismatched elements: 252531 / 4000032 (6.31%)
E Max absolute difference: 0.10531139
E Max relative difference: 865.5326
E x: array([[[-4.401309, -0.024628, -7.125814, ..., 4.503648, -4.59222 , E -1.076694],
E [-2.240759, 0.21017 , -8.47337 , ..., -2.105228, -1.818338,...
E y: array([[[-4.394317, -0.024287, -7.125648, ..., 4.478123, -4.585316,
E -1.076995],
E [-2.24629 , 0.210042, -8.475931, ..., -2.110665, -1.816817,...
self = <roberta-base_tf_test.RobertaBaseModuleTester object at 0x7fe155308f70>, dynamic = False, device = 'gpu'
E Not equal to tolerance rtol=0.01, atol=0.001
E
E Mismatched elements: 46624 / 804240 (5.8%)
E Max absolute difference: 0.04533577
E Max relative difference: 763.70135
E x: array([[[33.55235 , -3.827327, 18.863625, ..., 3.420343, 6.171632, E 11.648125],
E [-0.598835, -4.141003, 14.904708, ..., -4.515923, -1.790529,...
E y: array([[[33.567413, -3.829913, 18.870962, ..., 3.422938, 6.174327,
E 11.656706],
E [-0.58585 , -4.141752, 14.913631, ..., -4.516505, -1.788759,...
self = <mpnet-base_tf_test.MpNetModuleTester object at 0x7fe15525a080>, dynamic = False, device = 'gpu'
E AssertionError:
E Not equal to tolerance rtol=0.01, atol=0.001
E
E Mismatched elements: 59378 / 488432 (12.2%)
E Max absolute difference: 0.06668603 E Max relative difference: 3304.545 E x: array([[[40.389954, 4.286406, 23.76233 , ..., -1.074989, -0.482307, E 16.880697], E [ 2.257942, 0.504233, 8.199037, ..., -1.836042, 0.471555,...
E y: array([[[40.38317 , 4.290402, 23.760578, ..., -1.071989, -0.476889, E 16.869303],
E [ 2.256348, 0.50376 , 8.193238, ..., -1.834158, 0.474861,...
self = <mobilebert-uncased_tf_test.MobileBertModuleTester object at 0x7fe15525b340>, dynamic = False, device = 'gpu'
E AssertionError:
E Not equal to tolerance rtol=0.01, atol=0.001
E
E Mismatched elements: 99072 / 488352 (20.3%)
E Max absolute difference: 0.2849064
E Max relative difference: 570.024
E x: array([[[-4.563648, -8.917149, -9.508633, ..., -8.859805, -9.35775 ,
E -3.739411],
E [-8.470783, -8.042081, -7.747127, ..., -7.734895, -8.48076 ,...
E y: array([[[-4.561851, -8.916107, -9.508212, ..., -8.85981 , -9.357377, E -3.738622],
E [-8.470868, -8.044136, -7.747543, ..., -7.735366, -8.477797,...
self = <layoutlm-base-uncased_tf_test.LayoutLMModuleTester object at 0x7fe1552b6ec0>, dynamic = False, device = 'gpu'
E AssertionError:
E Not equal to tolerance rtol=0.01, atol=0.001
E
E Mismatched elements: 145352 / 488352 (29.8%)
E Max absolute difference: 0.0553565
E Max relative difference: 2522.9336
E x: array([[[-0.424161, 1.658019, 0.9119 , ..., 0.691548, 0.414469,
E 0.90081 ],
E [-0.761064, -0.302433, -1.195132, ..., -0.884939, 0.444821,...
E y: array([[[-0.41647 , 1.662008, 0.920087, ..., 0.697769, 0.41865 , E 0.905736],
E [-0.751545, -0.297387, -1.189691, ..., -0.874244, 0.443483,...
self = <electra-small-discriminator_tf_test.ElectraModuleTester object at 0x7fe15525abc0>, dynamic = False, device = 'gpu'
E AssertionError:
E Not equal to tolerance rtol=0.01, atol=0.001
E
E Mismatched elements: 58884 / 488352 (12.1%)
E Max absolute difference: 0.01450959
E Max relative difference: 756.9981
E x: array([[[ 1.150137e+00, 1.647311e-01, 1.618423e-01, ...,
E 1.635987e-01, 1.645508e-01, 1.536248e-01],
E [-2.518778e-02, 2.517256e-01, 2.526046e-01, ...,...
E y: array([[[ 1.151324, 0.167032, 0.164161, ..., 0.165891, 0.166873, E 0.155909],
E [-0.027828, 0.250528, 0.251408, ..., 0.253964, 0.253663,...
Is this expected? Should we revise the tolerances?
Can we disable TF32 for both IREE and and TF (if cuda backend is used as reference) and see if this goes away. I don't think there is much we can do if we want to use TF32.
@ThomasRaoux @KoolJBlack didn't get a chance to review this in today's sync. Do we have a priority for this?
This was discussed on the chat. There is not much we can do within IREE as long as we want to use TF32
@dan-garvey can you update the bug with your plan?
@dan-garvey Can you provide an update today?
Yeah, we're going to relax tolerance when TF32 is enabled. Any changes would be on the NVIDIA side, so nothing required from the IREE side afaik. Thanks for the support!
Great! Can we close this?