Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters
Hi, I am trying to use autoresume to continue train my failed jobs, but get the following error:
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 243, in _check_order
RuntimeError: Forward order differs across ranks: rank 0 is all-gathering 1 parameters while rank 4 is all-gathering -2142209024 parameters
When I use a single node to train a model, save checkpoint, and set autoresume=True to continue the training by using a single node, it works.
However, when I use 16 nodes to train a model, save checkpoint, and use 1 or 16 nodes to do autoresume, I get the aforementioned error.
I googled it, but only find this Stack Overflow. Same error, but no answer yet.
Apologies for the delay! Are you able to specify the checkpoint you want to load using load_path instead of autoresume=True? Or do you hit the same error?
@Landanjs Yes, I am able to use load_path. However, the job gets stuck at the very beginning if I use load_path=/path/of/checpoint and set load_weights_only=False.