adapters icon indicating copy to clipboard operation
adapters copied to clipboard

How does leave_out parameter rank all layers?

Open insomnia1996 opened this issue 2 years ago • 4 comments

Hey, you can use the leave_out parameter of the AdapterConfig to achieve this. The leave_out parameter specifies in which layers no adapter module should be added. You can simply pass a list of the indices of all encoder layers and then they do not have an adapter module. I hope this helps.

Hi, I'm confused about the documentation of leave_out: The IDs of the layers (starting at 0). Could you please explain how to rank the indices of all the layers? Taking 12-layer BART for an example, are encoder layers numbered from 0 to 11 and decoder layers 12-23? Thanks in advance!

Originally posted by @insomnia1996 in https://github.com/Adapter-Hub/adapter-transformers/issues/264#issuecomment-1094306657

insomnia1996 avatar Apr 11 '22 01:04 insomnia1996

Hi, yes you are right. For a BART model with 12 encoder layers and 12 decoder layers, the encoder layers would have ids 0 to 11 and the decoder layers would have ids 12 to 23.

hSterz avatar Apr 12 '22 09:04 hSterz

Hi, yes you are right. For a BART model with 12 encoder layers and 12 decoder layers, the encoder layers would have ids 0 to 11 and the decoder layers would have ids 12 to 23.

Thanks!

insomnia1996 avatar Apr 12 '22 09:04 insomnia1996

Sorry to bother, but I still find prefixtuning layers are added in BART decoder when I've already set the leave_out by

ptuning_config = PrefixTuningConfig(cross_prefix=False, leave_out=[12,13,14,15,16,17,18,19,20,21,22,23])
model.add_adapter("prefixtuningconfig", config=ptuning_config)

The output is shown below:

BartForConditionalGeneration(
  (shared_parameters): ModuleDict()
  (model): BartModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict()
    (shared): Embedding(50265, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (invertible_adapters): ModuleDict()
      (embed_tokens): Embedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict(
                  (mamconfig): PrefixTuningGroup(
                    (encoder_prefix): PrefixTuning(
                      (wte): Embedding(30, 1024)
                      (control_trans): Sequential(
                        (0): Linear(in_features=1024, out_features=512, bias=True)
                        (1): Activation_Function_Class(
                          (f): Tanh()
                        )
                        (2): Linear(in_features=512, out_features=24576, bias=True)
                      )
                      (dropout): Dropout(p=0.0, inplace=False)
                    )
                  )
                )
              )
            )
          )
...

    (decoder): BartDecoder(
      (embed_tokens): Embedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0): BartDecoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict(
                  (mamconfig): PrefixTuningGroup(
                    (encoder_prefix): PrefixTuning(
                      (wte): Embedding(30, 1024)
                      (control_trans): Sequential(
                        (0): Linear(in_features=1024, out_features=512, bias=True)
                        (1): Activation_Function_Class(
                          (f): Tanh()
                        )
                        (2): Linear(in_features=512, out_features=24576, bias=True)
                      )
                      (dropout): Dropout(p=0.0, inplace=False)
                    )
                  )
                )
              )
            )
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict(
                  (mamconfig): PrefixTuningGroup(
                    (encoder_prefix): PrefixTuning(
                      (wte): Embedding(30, 1024)
                      (control_trans): Sequential(
                        (0): Linear(in_features=1024, out_features=512, bias=True)
                        (1): Activation_Function_Class(
                          (f): Tanh()
                        )
                        (2): Linear(in_features=512, out_features=24576, bias=True)
                      )
                      (dropout): Dropout(p=0.0, inplace=False)
                    )
                  )
                )
              )
            )
          )
...

Am I correctly using the AdapterConfig? Thanks in advance!

insomnia1996 avatar May 27 '22 04:05 insomnia1996

Hey, regarding prefix tuning: The PrefixTuningPool is shared by all model layers, so its submodules appear in every layer. This does not mean prefixes are added to each layer.

calpt avatar Jul 26 '22 17:07 calpt

@calpt @hSterz , sorry to bother you here. Is this the same situation for adapters? Trying to leave_out some layers but they still show up to have adapters. And the number of trainable parameters remain the same.

ajesujoba avatar May 19 '23 15:05 ajesujoba

Hey @ajesujoba. For bottleneck adapters, this should not be the case. If leave_out is added, no adapter modules should be added to the respective layers and the number of trainable parameters should change.

calpt avatar May 21 '23 16:05 calpt