Add Correct Attention Dense Gem (autoTP) for FunnelTransformer and TransformerXL
Motivation: This PR highlights the correct attention output/dense Gemm on AutoTP for the 2 models -
-
FunnelTransformer -
post_proj -
TransformerXL -
o_net
The configuration for FunnelTransformer:
FunnelModel(
(embeddings): FunnelEmbeddings(
(word_embeddings): Embedding(30522, 768)
(layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): FunnelEncoder(
(attention_structure): FunnelAttentionStructure(
(sin_dropout): Dropout(p=0.1, inplace=False)
(cos_dropout): Dropout(p=0.1, inplace=False)
)
(blocks): ModuleList(
(0-2): 3 x ModuleList(
(0-3): 4 x FunnelLayer(
(attention): FunnelRelMultiheadAttention(
(hidden_dropout): Dropout(p=0.1, inplace=False)
(attention_dropout): Dropout(p=0.1, inplace=False)
(q_head): Linear(in_features=768, out_features=768, bias=False)
(k_head): Linear(in_features=768, out_features=768, bias=True)
(v_head): Linear(in_features=768, out_features=768, bias=True)
(post_proj): Linear(in_features=768, out_features=768, bias=True)
(layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
)
(ffn): FunnelPositionwiseFFN(
(linear_1): Linear(in_features=768, out_features=3072, bias=True)
(activation_function): NewGELUActivation()
(activation_dropout): Dropout(p=0.0, inplace=False)
(linear_2): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
)
)
)
)
)
(decoder): FunnelDecoder(
(attention_structure): FunnelAttentionStructure(
(sin_dropout): Dropout(p=0.1, inplace=False)
(cos_dropout): Dropout(p=0.1, inplace=False)
)
(layers): ModuleList(
(0-1): 2 x FunnelLayer(
(attention): FunnelRelMultiheadAttention(
(hidden_dropout): Dropout(p=0.1, inplace=False)
(attention_dropout): Dropout(p=0.1, inplace=False)
(q_head): Linear(in_features=768, out_features=768, bias=False)
(k_head): Linear(in_features=768, out_features=768, bias=True)
(v_head): Linear(in_features=768, out_features=768, bias=True)
(post_proj): Linear(in_features=768, out_features=768, bias=True)
(layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
)
(ffn): FunnelPositionwiseFFN(
(linear_1): Linear(in_features=768, out_features=3072, bias=True)
(activation_function): NewGELUActivation()
(activation_dropout): Dropout(p=0.0, inplace=False)
(linear_2): Linear(in_features=3072, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
)
)
)
)
)
The configuration for TransformerXL:
TransfoXLModel(
(word_emb): AdaptiveEmbedding(
(emb_layers): ModuleList(
(0): Embedding(20000, 1024)
(1): Embedding(20000, 256)
(2): Embedding(160000, 64)
(3): Embedding(67735, 16)
)
(emb_projs): ParameterList(
(0): Parameter containing: [torch.float32 of size 1024x1024]
(1): Parameter containing: [torch.float32 of size 1024x256]
(2): Parameter containing: [torch.float32 of size 1024x64]
(3): Parameter containing: [torch.float32 of size 1024x16]
)
)
(drop): Dropout(p=0.1, inplace=False)
(layers): ModuleList(
(0-17): 18 x RelPartialLearnableDecoderLayer(
(dec_attn): RelPartialLearnableMultiHeadAttn(
(qkv_net): Linear(in_features=1024, out_features=3072, bias=False)
(drop): Dropout(p=0.1, inplace=False)
(dropatt): Dropout(p=0.0, inplace=False)
(o_net): Linear(in_features=1024, out_features=1024, bias=False)
(layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(r_net): Linear(in_features=1024, out_features=1024, bias=False)
)
(pos_ff): PositionwiseFF(
(CoreNet): Sequential(
(0): Linear(in_features=1024, out_features=4096, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=4096, out_features=1024, bias=True)
(4): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
)
(pos_emb): PositionalEmbedding()
)
@molly-smith, @RezaYazdaniAminabadi requesting review. Thanks
Hi @abhilash1910, part of these changes should not be necessary. These gems are already included in the autoTP generated policies for these models. So you should be able to run these models without modifying AutoTP. If you have any issues running these models please submit an issue or feature request.
Thanks @molly-smith for the review, I am running some tests on them in the meantime . I see that it is getting detected in built in policy ; will close if tests pass.
I tested these models on my side. They are not supported and will require more involved changes. Closing this PR.
Hi @molly-smith this is now supported right? Could you let me know on this.