diffusion Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters

Hi, I am trying to use autoresume to continue train my failed jobs, but get the following error:

File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters

When I use a single node to train a model, save checkpoint, and set autoresume=True to continue the training by using a single node, it works. However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume, I get the aforementioned error. I googled it, but only find this Stack Overflow. Same error, but no answer yet.

Jun 26 '23 07:06 viyjy

Apologies for the delay! Are you able to specify the checkpoint you want to load using load_path instead of autoresume=True? Or do you hit the same error?

Jul 17 '23 21:07 Landanjs

@Landanjs Yes, I am able to use load_path. However, the job gets stuck at the very beginning if I use load_path=/path/of/checpoint and set load_weights_only=False.

Dec 13 '23 19:12 viyjy