SHARK
SHARK copied to clipboard
TF roberta/XLM roberta numerics issues on A100 if num_iterations >= 100
XLM-roberta assert failure:
> np.testing.assert_allclose(golden_out, result, rtol=1e-01, atol=1e-02)
E AssertionError:
E Not equal to tolerance rtol=0.1, atol=0.01
E
E Mismatched elements: 5505 / 4000032 (0.138%)
E Max absolute difference: 0.09074688
E Max relative difference: 3171.7234
E x: array([[[ 2.683771, 0.183121, 10.453473, ..., 6.315439, 2.047505,
E 3.32532 ],
E [-0.482143, 0.061366, 9.494564, ..., 6.593861, 1.620899,...
E y: array([[[ 2.671124, 0.182537, 10.456981, ..., 6.322483, 2.0[515](https://github.com/nod-ai/SHARK/runs/7868468050?check_suite_focus=true#step:9:516)46,
E 3.322179],
E [-0.481575, 0.061454, 9.495419, ..., 6.59101 , 1.619549,...
roberta-base-tf assert failure:
> np.testing.assert_allclose(golden_out, result, rtol=1e-01, atol=1e-02)
E AssertionError:
E Not equal to tolerance rtol=0.1, atol=0.01
E
E Mismatched elements: 453 / 804240 (0.0563%)
E Max absolute difference: 0.04533577
E Max relative difference: 763.70135
E x: array([[[33.55235 , -3.827327, 18.863625, ..., 3.420343, 6.171632,
E 11.648125],
E [-0.598835, -4.141003, 14.904708, ..., -4.515923, -1.790529,...
E y: array([[[33.567413, -3.829913, 18.870962, ..., 3.422938, 6.174327,
E 11.656706],
E [-0.58585 , -4.141752, 14.913631, ..., -4.516505, -1.788759,...
To reproduce:
On a100 instance,
- remove xfail for gpu case in tank/roberta-base_tf/roberta-base_tf_test.py
- remove xfail for gpu case in tank/xlm-roberta-base_tf/xlm-roberta-base_tf.py
- run:
pytest tank/*roberta -k "gpu"
until patch is merged checkout branch ean-bench to reproduce
perhaps the solution to this will be keeping default shark_args.num_iterations = 1 and increasing only for benchmarks.
This issue no longer relevant, closing.