openfold
openfold copied to clipboard
About version of zero_to_fp32.py
Hi, I'm training openfold with single node multi GPUs and multi nodes multi GPUs. I find something interesting.
In the environment.yml file the deepspeed version is deepspeed==0.5.3. But the zero_to_fp32.py file in folder scripts is some new version(contains function _get_fp32_state_dict_from_zero2_checkpoint). This makes it impossible to load weight from ckpt.
And, I'm new to DeepSpeed, when I use single node multi GPUs, I can load the model with alphafold2 weight and add noise, then infer from the openfold trained on small data, the outputs are good. But under multi nodes multi GPUs, when load the model with alphafold2 weight and add noise, the inference will be almost all zero. Do you know why?
You should be safe updating deepspeed to the newest version---as I recall, that explicit version number isn't necessary. I'll remove it later.
As for the second issue, see #88. There's some weirdness surrounding distributed DeepSpeed checkpoints, and though I haven't gotten to the bottom of the issue, I propose a workaround there.