Megatron-LM
Megatron-LM copied to clipboard
adjust the keys of attention in checkpoint
I try to convert gpt checkpoint from local to transformer_engine according to following map
{ 'input_layernorm.': 'self_attention.linear_qkv.layer_norm_', 'pre_mlp_layernorm.': 'mlp.linear_fc1.layer_norm_', }
It works well only when the optimizer is not loaded, because:
In local or transformer_engine checkpoint, linear_proj is in front of linear_qkv, due to the order of initialization
- local
- layernorm -> linear_proj -> linear_qkv
- transformer_engine
- linear_proj -> layernorm -> linear_qkv
For optimizer, it is necessary to swap sequential stored weights. Another method, as this pr do, build linear_qkv first, as same as the forward order
- layernorm -> linear_qkv -> linear_proj
@sudhakarsingh27 Please correct me if I'm wrong
Marking as stale. No activity in 60 days.