gpt-neox
gpt-neox copied to clipboard
Checkpoint merge script
This PR removes the old merge script and adds a new one.
I assume the PR for config file management https://github.com/EleutherAI/gpt-neox/pull/463 to be merged so that config files are in the global_step* directory.
Note that I have only tested on a number of config parameters that are common to me. There might be cases where this merge script does not apply due to partition dimensions being different from the ones expected.
Ok @sweinbach I did some pretty extensive testing / debugging of this, and had to change a fair few things to get it to work at all, but I think this is about as good as we're going to get it, for reasons i'll explain further down in the comment. (unless I'm missing some bug / error in my implementation)
Stuff I changed:
- Made the function accessible without the argparse args, so we can use it externally if needs be.
- Got rid of the double loop (looping over mp first, then grouped weights.) this sped things up a little for a larger model, and produced the same outputs (verified with
tools/inspect_checkpoints.pybefore and after.) - replaced load/save directories with the output directory - otherwise upon reloading, neox will still try to load the weights from the parallelized model.
- Ensured the vocab size in the merged model was correct by modifying the
make_vocab_size_divisible_byarg here. - Told megatron not to load RNG states in the resulting model, as they will be incorrect.
- Removed
data_parallel_world_sizefrom ignored layers - deepspeed actually raised an error if this wasn't in the checkpoint weights for me, because of this line. - Different strategy for merging - for some reason, the replicated (non model parallel) parameters are not all equal in all checkpoints. Due to this, we can't rely on
elif all_equal(partitions):to catch all the replicated parameters. Instead, I take a rule based approach - all parameters except for vocab parallel embedding, row parallel linear weights, column parallel linear weights, and column parallel linear biases are replicated (you can see how NVIDIA megatron does this for reference), and can be merged either by just taking the 0th partition, or by taking the average of all partitions. - For efficiency, only the first of the
model_statescheckpoints are loaded. This was by far the most time consuming operation (~5 minutes, or about half the runtime for my model), and we really only need the first of them, because we're throwing out the optimizer states, and everything else is replicated. - I also write out the
latestglobal_step file here.
Results:
So, the results aren't great. Since as I mentioned, the duplicated parameters across all ranks are not actually all the same, doing the merge does result in a loss of accuracy. It's not small, but it's not model breaking, and I suspect that the accuracy could possibly be recovered with a bit of extra tuning.
These are the eval results on lambada for the base model:
{
"lambada": {
"ppl": 4.843396407429969,
"acc": 0.6955171744614788,
}
}
and for the merged model (zeroth partition):
{
"lambada": {
"ppl": 5.400162748158698,
"acc": 0.6751406947409276,
}
}
and for the merged model with averaged parameters:
{
"lambada": {
"ppl": 5.391328941501356,
"acc": 0.6794100523966622,
}
}
Differences:
The differences between the replicated parameters are mostly very small, this is the difference between two input_layernorm.weights in 2 model parallel ranks of a 20B model:
As you can see, mostly the same with very slight differences. Other layers are similar:
diff('input_layernorm.weight').sum()
# tensor(-0.0024, dtype=torch.float16)
diff('attention.dense.bias').sum()
# tensor(-0.0009, dtype=torch.float16)
diff('post_attention_layernorm.weight').sum()
# tensor(-0.0088, dtype=torch.float16)
I think it's worth talking to the megatron and/or deepspeed devs to see if they've had a similar problem. If (the difference between replicated parameters) is something fixable, we can integrate it into neox. In theory, data parallelism should result in all these params being equal, but maybe there's something I'm missing...
Anyway, script works, I think this can be merged, but maybe we should log a warning that you'll get accuracy loss.
Additionally, there are certain settings I know this won't work for at all (geglu, for example), so we should also throw an error for known bad settings.
Thank you @sdtblck. Points all taken and valid. I can confirm the observation that all_equal assumption does not hold for larger models. I checked for smaller models, where this does not seem to be an issue.
Thanks for the effort!
@sweinbach anything else to add, or would you say this is ready to merge?
Closing as superceded by other approaches