torchtitan
torchtitan copied to clipboard
Starting off with different models across ranks and FSDP doesn't synchronise
https://github.com/pytorch/torchtrain/blob/8dd5798241490c5f532e822e9f9c1d30e0fba0df/train.py#L155-L159
Hey there ! With @mathuvu we ran into a sneaky bug in our codebase that broke loss evolution parity with xlformers.
Basically we didn't seed torch before initialising the model so each data parallel rank got a different model. We used pytorch's FSDP1 and realised that it never synced the model weights at startup so we ended up optimising N different models but with the same gradient. This gave us much higher loss than xlformer and it also stagnated much earlier. But surprisingly the model was still training and the loss still went down !
I quickly searched in this codebase for torch manual seeding and didn't find anything so i wanted to raise this issue in case i can spare some frustrating debugging!
After fixing this and a few differences in initialisation we now match the loss of xlformer exactly at least in the no_shard setting on small models when starting with the same init.
I am curious to learn more about this for my understanding.
Consider one weight w
. IIUC, we are comparing (1) randomly initialize one single w
and have each of the N
ranks take its corresponding slice of w
vs. (2) randomly initialize N
w
s, e.g. w_1, ..., w_N
, and have the i
th rank take its corresponding slice of w_i
.
When we consider the gathered slices to reconstruct w
, under what cases is (2) not still a valid sample from the same probability distribution used for (1)?
In the case of NO_SHARD it's definitely an issue to start with different models across ranks since there is no communication between ranks on the model itself so you just do optimisation on different models with the same gradient.
Now in the case of a sharded model, especially with meta init, this issue might indeed solve itself since the only thing we get from a rank is its assigned shard which it can init however it likes as long as it follows the same distribution! And actually in that case manual seeding might even break the training if we have the same init for all shards of the model.. But in any case, it is not clear to me how the initialisation is done internally in FSDP and we should have a procedure that 1) guarantees correct IID init (no two shards initialised with the same values because ranks happen to use the same RNG entropy on the same node) 2) allows for some amount of reproducibility. The closest analogy is in dataloaders where we need to explicitly handle this for example by seeding with seed + dp_rank to ensure we don't get the same augmentations or data source choices on all dp_ranks.
@BadrYoubiIdrissi This makes sense!
I am curious to learn more about your use case of NO_SHARD
. Is it mainly that it is easy to switch between sharding strategies? Or, is there any other reason for not using DDP in that case?
@BadrYoubiIdrissi is this a issue coupling with FSDP1 + meta init, or simply FSDP1 alone? I think for FSDP1 there's a sync_module_states=True
flag that could allow each rank to have the same replica first before perform sharding?
I am going to close this issue for now since this is related to FSDP1 NO_SHARD
, not FSDP2.