openfold
openfold copied to clipboard
Deepspeed evoformer attention bfloat16 consistency issue
We have observed a few issues with the deepspeed evoformer attention for bfloat16. In particular:
- The unit test
TestDeepSpeedKernel.test_compare_modelintests/test_deepspeed_evo_attention.pyfails 30-50% of the time when the precision is set totorch.bfloat16. This inconistency has been observed by other users as well https://github.com/aqlaboratory/openfold/issues/403#issuecomment-2464460708 - The Deepspeed unit test
tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.pyfails ~40% of the time when there is both abiasterm and abfloat16is used: https://github.com/deepspeedai/DeepSpeed/issues/5052
It's unclear at this time if there are specific values of attention initialization or input values that are causing this issue. More investigation is eneded.