openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Deepspeed evoformer attention bfloat16 consistency issue

Open jnwei opened this issue 7 months ago • 0 comments

We have observed a few issues with the deepspeed evoformer attention for bfloat16. In particular:

  • The unit test TestDeepSpeedKernel.test_compare_model in tests/test_deepspeed_evo_attention.py fails 30-50% of the time when the precision is set to torch.bfloat16. This inconistency has been observed by other users as well https://github.com/aqlaboratory/openfold/issues/403#issuecomment-2464460708
  • The Deepspeed unit test tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py fails ~40% of the time when there is both a bias term and a bfloat16 is used: https://github.com/deepspeedai/DeepSpeed/issues/5052

It's unclear at this time if there are specific values of attention initialization or input values that are causing this issue. More investigation is eneded.

jnwei avatar Apr 25 '25 20:04 jnwei