Megatron-LM adjust the keys of attention in checkpoint

adjust the keys of attention in checkpoint

Open xiaojunjie opened this issue 1 year ago • 2 comments

I try to convert gpt checkpoint from local to transformer_engine according to following map { 'input_layernorm.': 'self_attention.linear_qkv.layer_norm_', 'pre_mlp_layernorm.': 'mlp.linear_fc1.layer_norm_', } It works well only when the optimizer is not loaded, because:

In local or transformer_engine checkpoint, linear_proj is in front of linear_qkv, due to the order of initialization

local
- layernorm -> linear_proj -> linear_qkv
transformer_engine
- linear_proj -> layernorm -> linear_qkv

For optimizer, it is necessary to swap sequential stored weights. Another method, as this pr do, build linear_qkv first, as same as the forward order

layernorm -> linear_qkv -> linear_proj

Feb 18 '24 12:02 xiaojunjie

@sudhakarsingh27 Please correct me if I'm wrong

Feb 19 '24 03:02 xiaojunjie

Marking as stale. No activity in 60 days.

Apr 19 '24 18:04 github-actions[bot]

Megatron-LM Megatron-LM copied to clipboard

adjust the keys of attention in checkpoint

Megatron-LM
Megatron-LM copied to clipboard