SHARK TF roberta/XLM roberta numerics issues on A100 if num

XLM-roberta assert failure:

>       np.testing.assert_allclose(golden_out, result, rtol=1e-01, atol=1e-02)
E       AssertionError: 
E       Not equal to tolerance rtol=0.1, atol=0.01
E       
E       Mismatched elements: 5505 / 4000032 (0.138%)
E       Max absolute difference: 0.09074688
E       Max relative difference: 3171.7234
E        x: array([[[ 2.683771,  0.183121, 10.453473, ...,  6.315439,  2.047505,
E                 3.32532 ],
E               [-0.482143,  0.061366,  9.494564, ...,  6.593861,  1.620899,...
E        y: array([[[ 2.671124,  0.182537, 10.456981, ...,  6.322483,  2.0[515](https://github.com/nod-ai/SHARK/runs/7868468050?check_suite_focus=true#step:9:516)46,
E                 3.322179],
E               [-0.481575,  0.061454,  9.495419, ...,  6.59101 ,  1.619549,...

roberta-base-tf assert failure:

>       np.testing.assert_allclose(golden_out, result, rtol=1e-01, atol=1e-02)
E       AssertionError: 
E       Not equal to tolerance rtol=0.1, atol=0.01
E       
E       Mismatched elements: 453 / 804240 (0.0563%)
E       Max absolute difference: 0.04533577
E       Max relative difference: 763.70135
E        x: array([[[33.55235 , -3.827327, 18.863625, ...,  3.420343,  6.171632,
E                11.648125],
E               [-0.598835, -4.141003, 14.904708, ..., -4.515923, -1.790529,...
E        y: array([[[33.567413, -3.829913, 18.870962, ...,  3.422938,  6.174327,
E                11.656706],
E               [-0.58585 , -4.141752, 14.913631, ..., -4.516505, -1.788759,...

To reproduce:

On a100 instance,

remove xfail for gpu case in tank/roberta-base_tf/roberta-base_tf_test.py
remove xfail for gpu case in tank/xlm-roberta-base_tf/xlm-roberta-base_tf.py
run: pytest tank/*roberta -k "gpu"

Aug 16 '22 23:08 monorimet

until patch is merged checkout branch ean-bench to reproduce

Aug 16 '22 23:08 monorimet

perhaps the solution to this will be keeping default shark_args.num_iterations = 1 and increasing only for benchmarks.

Aug 17 '22 00:08 monorimet

This issue no longer relevant, closing.

Feb 01 '23 22:02 monorimet

TF roberta/XLM roberta numerics issues on A100 if num_iterations >= 100