openfold Deepspeed evoformer attention bfloat16 consistency issue

Deepspeed evoformer attention bfloat16 consistency issue

Open jnwei opened this issue 7 months ago • 0 comments

We have observed a few issues with the deepspeed evoformer attention for bfloat16. In particular:

The unit test TestDeepSpeedKernel.test_compare_model in tests/test_deepspeed_evo_attention.py fails 30-50% of the time when the precision is set to torch.bfloat16. This inconistency has been observed by other users as well https://github.com/aqlaboratory/openfold/issues/403#issuecomment-2464460708
The Deepspeed unit test tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py fails ~40% of the time when there is both a bias term and a bfloat16 is used: https://github.com/deepspeedai/DeepSpeed/issues/5052

It's unclear at this time if there are specific values of attention initialization or input values that are causing this issue. More investigation is eneded.

Apr 25 '25 20:04 jnwei

openfold openfold copied to clipboard

Deepspeed evoformer attention bfloat16 consistency issue

openfold
openfold copied to clipboard