DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Add Correct Attention Dense Gem (autoTP) for FunnelTransformer and TransformerXL

Open abhilash1910 opened this issue 2 years ago • 4 comments

Motivation: This PR highlights the correct attention output/dense Gemm on AutoTP for the 2 models -

  1. FunnelTransformer - post_proj

  2. TransformerXL - o_net

The configuration for FunnelTransformer:

FunnelModel(
  (embeddings): FunnelEmbeddings(
    (word_embeddings): Embedding(30522, 768)
    (layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): FunnelEncoder(
    (attention_structure): FunnelAttentionStructure(
      (sin_dropout): Dropout(p=0.1, inplace=False)
      (cos_dropout): Dropout(p=0.1, inplace=False)
    )
    (blocks): ModuleList(
      (0-2): 3 x ModuleList(
        (0-3): 4 x FunnelLayer(
          (attention): FunnelRelMultiheadAttention(
            (hidden_dropout): Dropout(p=0.1, inplace=False)
            (attention_dropout): Dropout(p=0.1, inplace=False)
            (q_head): Linear(in_features=768, out_features=768, bias=False)
            (k_head): Linear(in_features=768, out_features=768, bias=True)
            (v_head): Linear(in_features=768, out_features=768, bias=True)
            (post_proj): Linear(in_features=768, out_features=768, bias=True)
            (layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
          )
          (ffn): FunnelPositionwiseFFN(
            (linear_1): Linear(in_features=768, out_features=3072, bias=True)
            (activation_function): NewGELUActivation()
            (activation_dropout): Dropout(p=0.0, inplace=False)
            (linear_2): Linear(in_features=3072, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
          )
        )
      )
    )
  )
  (decoder): FunnelDecoder(
    (attention_structure): FunnelAttentionStructure(
      (sin_dropout): Dropout(p=0.1, inplace=False)
      (cos_dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0-1): 2 x FunnelLayer(
        (attention): FunnelRelMultiheadAttention(
          (hidden_dropout): Dropout(p=0.1, inplace=False)
          (attention_dropout): Dropout(p=0.1, inplace=False)
          (q_head): Linear(in_features=768, out_features=768, bias=False)
          (k_head): Linear(in_features=768, out_features=768, bias=True)
          (v_head): Linear(in_features=768, out_features=768, bias=True)
          (post_proj): Linear(in_features=768, out_features=768, bias=True)
          (layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
        )
        (ffn): FunnelPositionwiseFFN(
          (linear_1): Linear(in_features=768, out_features=3072, bias=True)
          (activation_function): NewGELUActivation()
          (activation_dropout): Dropout(p=0.0, inplace=False)
          (linear_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (layer_norm): LayerNorm((768,), eps=1e-09, elementwise_affine=True)
        )
      )
    )
  )
)

The configuration for TransformerXL:

TransfoXLModel(
  (word_emb): AdaptiveEmbedding(
    (emb_layers): ModuleList(
      (0): Embedding(20000, 1024)
      (1): Embedding(20000, 256)
      (2): Embedding(160000, 64)
      (3): Embedding(67735, 16)
    )
    (emb_projs): ParameterList(
        (0): Parameter containing: [torch.float32 of size 1024x1024]
        (1): Parameter containing: [torch.float32 of size 1024x256]
        (2): Parameter containing: [torch.float32 of size 1024x64]
        (3): Parameter containing: [torch.float32 of size 1024x16]
    )
  )
  (drop): Dropout(p=0.1, inplace=False)
  (layers): ModuleList(
    (0-17): 18 x RelPartialLearnableDecoderLayer(
      (dec_attn): RelPartialLearnableMultiHeadAttn(
        (qkv_net): Linear(in_features=1024, out_features=3072, bias=False)
        (drop): Dropout(p=0.1, inplace=False)
        (dropatt): Dropout(p=0.0, inplace=False)
        (o_net): Linear(in_features=1024, out_features=1024, bias=False)
        (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (r_net): Linear(in_features=1024, out_features=1024, bias=False)
      )
      (pos_ff): PositionwiseFF(
        (CoreNet): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): ReLU(inplace=True)
          (2): Dropout(p=0.1, inplace=False)
          (3): Linear(in_features=4096, out_features=1024, bias=True)
          (4): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (pos_emb): PositionalEmbedding()
)

abhilash1910 avatar Apr 27 '23 16:04 abhilash1910

@molly-smith, @RezaYazdaniAminabadi requesting review. Thanks

abhilash1910 avatar Apr 27 '23 16:04 abhilash1910

Hi @abhilash1910, part of these changes should not be necessary. These gems are already included in the autoTP generated policies for these models. So you should be able to run these models without modifying AutoTP. If you have any issues running these models please submit an issue or feature request.

molly-smith avatar May 03 '23 18:05 molly-smith

Thanks @molly-smith for the review, I am running some tests on them in the meantime . I see that it is getting detected in built in policy ; will close if tests pass.

abhilash1910 avatar May 08 '23 07:05 abhilash1910

I tested these models on my side. They are not supported and will require more involved changes. Closing this PR.

molly-smith avatar May 09 '23 20:05 molly-smith

Hi @molly-smith this is now supported right? Could you let me know on this.

abhilash1910 avatar Aug 09 '23 03:08 abhilash1910