iree icon indicating copy to clipboard operation
iree copied to clipboard

Correct error tolerances for golden values on A100

Open dan-garvey opened this issue 3 years ago • 4 comments

Issue body

After #9975 error tolerances for shark on A100 have been exceeded for a few models. Here are some numbers:

self = <xlm-roberta-base_tf_test.XLMRobertaModuleTester object at 0x7fe155272770>, dynamic = False, device = 'gpu'                                                                                                
E       Mismatched elements: 252531 / 4000032 (6.31%)
E       Max absolute difference: 0.10531139
E       Max relative difference: 865.5326
E        x: array([[[-4.401309, -0.024628, -7.125814, ...,  4.503648, -4.59222 ,                                                                                                                                  E                -1.076694],
E               [-2.240759,  0.21017 , -8.47337 , ..., -2.105228, -1.818338,...
E        y: array([[[-4.394317, -0.024287, -7.125648, ...,  4.478123, -4.585316,
E                -1.076995],
E               [-2.24629 ,  0.210042, -8.475931, ..., -2.110665, -1.816817,...  
self = <roberta-base_tf_test.RobertaBaseModuleTester object at 0x7fe155308f70>, dynamic = False, device = 'gpu'
E       Not equal to tolerance rtol=0.01, atol=0.001
E
E       Mismatched elements: 46624 / 804240 (5.8%)
E       Max absolute difference: 0.04533577
E       Max relative difference: 763.70135
E        x: array([[[33.55235 , -3.827327, 18.863625, ...,  3.420343,  6.171632,                                                                                                                                  E                11.648125],
E               [-0.598835, -4.141003, 14.904708, ..., -4.515923, -1.790529,...
E        y: array([[[33.567413, -3.829913, 18.870962, ...,  3.422938,  6.174327,
E                11.656706],
E               [-0.58585 , -4.141752, 14.913631, ..., -4.516505, -1.788759,... 
self = <mpnet-base_tf_test.MpNetModuleTester object at 0x7fe15525a080>, dynamic = False, device = 'gpu' 
E       AssertionError:
E       Not equal to tolerance rtol=0.01, atol=0.001
E
E       Mismatched elements: 59378 / 488432 (12.2%)
E       Max absolute difference: 0.06668603                                                                                                                                                                       E       Max relative difference: 3304.545                                                                                                                                                                         E        x: array([[[40.389954,  4.286406, 23.76233 , ..., -1.074989, -0.482307,                                                                                                                                  E                16.880697],                                                                                                                                                                                      E               [ 2.257942,  0.504233,  8.199037, ..., -1.836042,  0.471555,...
E        y: array([[[40.38317 ,  4.290402, 23.760578, ..., -1.071989, -0.476889,                                                                                                                                  E                16.869303],
E               [ 2.256348,  0.50376 ,  8.193238, ..., -1.834158,  0.474861,...    


self = <mobilebert-uncased_tf_test.MobileBertModuleTester object at 0x7fe15525b340>, dynamic = False, device = 'gpu'  
E       AssertionError:
E       Not equal to tolerance rtol=0.01, atol=0.001
E
E       Mismatched elements: 99072 / 488352 (20.3%)
E       Max absolute difference: 0.2849064
E       Max relative difference: 570.024
E        x: array([[[-4.563648, -8.917149, -9.508633, ..., -8.859805, -9.35775 ,
E                -3.739411],
E               [-8.470783, -8.042081, -7.747127, ..., -7.734895, -8.48076 ,...
E        y: array([[[-4.561851, -8.916107, -9.508212, ..., -8.85981 , -9.357377,                                                                                                                                  E                -3.738622],
E               [-8.470868, -8.044136, -7.747543, ..., -7.735366, -8.477797,...   
self = <layoutlm-base-uncased_tf_test.LayoutLMModuleTester object at 0x7fe1552b6ec0>, dynamic = False, device = 'gpu'
E       AssertionError:
E       Not equal to tolerance rtol=0.01, atol=0.001
E
E       Mismatched elements: 145352 / 488352 (29.8%)
E       Max absolute difference: 0.0553565
E       Max relative difference: 2522.9336
E        x: array([[[-0.424161,  1.658019,  0.9119  , ...,  0.691548,  0.414469,
E                 0.90081 ],
E               [-0.761064, -0.302433, -1.195132, ..., -0.884939,  0.444821,...
E        y: array([[[-0.41647 ,  1.662008,  0.920087, ...,  0.697769,  0.41865 ,                                                                                                                                  E                 0.905736],
E               [-0.751545, -0.297387, -1.189691, ..., -0.874244,  0.443483,...   
self = <electra-small-discriminator_tf_test.ElectraModuleTester object at 0x7fe15525abc0>, dynamic = False, device = 'gpu'
E       AssertionError:
E       Not equal to tolerance rtol=0.01, atol=0.001
E
E       Mismatched elements: 58884 / 488352 (12.1%)
E       Max absolute difference: 0.01450959
E       Max relative difference: 756.9981
E        x: array([[[ 1.150137e+00,  1.647311e-01,  1.618423e-01, ...,
E                 1.635987e-01,  1.645508e-01,  1.536248e-01],
E               [-2.518778e-02,  2.517256e-01,  2.526046e-01, ...,...
E        y: array([[[ 1.151324,  0.167032,  0.164161, ...,  0.165891,  0.166873,                                                                                                                                  E                 0.155909],
E               [-0.027828,  0.250528,  0.251408, ...,  0.253964,  0.253663,...   

Is this expected? Should we revise the tolerances?

dan-garvey avatar Aug 09 '22 19:08 dan-garvey

Can we disable TF32 for both IREE and and TF (if cuda backend is used as reference) and see if this goes away. I don't think there is much we can do if we want to use TF32.

ThomasRaoux avatar Aug 09 '22 20:08 ThomasRaoux

@ThomasRaoux @KoolJBlack didn't get a chance to review this in today's sync. Do we have a priority for this?

allieculp avatar Aug 11 '22 18:08 allieculp

This was discussed on the chat. There is not much we can do within IREE as long as we want to use TF32

ThomasRaoux avatar Aug 11 '22 18:08 ThomasRaoux

@dan-garvey can you update the bug with your plan?

ThomasRaoux avatar Aug 11 '22 18:08 ThomasRaoux

@dan-garvey Can you provide an update today?

allieculp avatar Aug 15 '22 19:08 allieculp

Yeah, we're going to relax tolerance when TF32 is enabled. Any changes would be on the NVIDIA side, so nothing required from the IREE side afaik. Thanks for the support!

dan-garvey avatar Aug 15 '22 19:08 dan-garvey

Great! Can we close this?

allieculp avatar Aug 15 '22 20:08 allieculp