adapters How does leave_out parameter rank all layers?

Hey, you can use the leave_out parameter of the AdapterConfig to achieve this. The leave_out parameter specifies in which layers no adapter module should be added. You can simply pass a list of the indices of all encoder layers and then they do not have an adapter module. I hope this helps.

Hi, I'm confused about the documentation of leave_out: The IDs of the layers (starting at 0). Could you please explain how to rank the indices of all the layers? Taking 12-layer BART for an example, are encoder layers numbered from 0 to 11 and decoder layers 12-23? Thanks in advance!

Originally posted by @insomnia1996 in https://github.com/Adapter-Hub/adapter-transformers/issues/264#issuecomment-1094306657

Apr 11 '22 01:04 insomnia1996

Hi, yes you are right. For a BART model with 12 encoder layers and 12 decoder layers, the encoder layers would have ids 0 to 11 and the decoder layers would have ids 12 to 23.

Apr 12 '22 09:04 hSterz

Hi, yes you are right. For a BART model with 12 encoder layers and 12 decoder layers, the encoder layers would have ids 0 to 11 and the decoder layers would have ids 12 to 23.

Thanks!

Apr 12 '22 09:04 insomnia1996

Sorry to bother, but I still find prefixtuning layers are added in BART decoder when I've already set the leave_out by

ptuning_config = PrefixTuningConfig(cross_prefix=False, leave_out=[12,13,14,15,16,17,18,19,20,21,22,23])
model.add_adapter("prefixtuningconfig", config=ptuning_config)

The output is shown below:

BartForConditionalGeneration(
  (shared_parameters): ModuleDict()
  (model): BartModel(
    (shared_parameters): ModuleDict()
    (invertible_adapters): ModuleDict()
    (shared): Embedding(50265, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (invertible_adapters): ModuleDict()
      (embed_tokens): Embedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0): BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict(
                  (mamconfig): PrefixTuningGroup(
                    (encoder_prefix): PrefixTuning(
                      (wte): Embedding(30, 1024)
                      (control_trans): Sequential(
                        (0): Linear(in_features=1024, out_features=512, bias=True)
                        (1): Activation_Function_Class(
                          (f): Tanh()
                        )
                        (2): Linear(in_features=512, out_features=24576, bias=True)
                      )
                      (dropout): Dropout(p=0.0, inplace=False)
                    )
                  )
                )
              )
            )
          )
...

    (decoder): BartDecoder(
      (embed_tokens): Embedding(50265, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0): BartDecoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict(
                  (mamconfig): PrefixTuningGroup(
                    (encoder_prefix): PrefixTuning(
                      (wte): Embedding(30, 1024)
                      (control_trans): Sequential(
                        (0): Linear(in_features=1024, out_features=512, bias=True)
                        (1): Activation_Function_Class(
                          (f): Tanh()
                        )
                        (2): Linear(in_features=512, out_features=24576, bias=True)
                      )
                      (dropout): Dropout(p=0.0, inplace=False)
                    )
                  )
                )
              )
            )
          )
          (activation_fn): GELUActivation()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (prefix_tuning): PrefixTuningShim(
              (pool): PrefixTuningPool(
                (prefix_tunings): ModuleDict(
                  (mamconfig): PrefixTuningGroup(
                    (encoder_prefix): PrefixTuning(
                      (wte): Embedding(30, 1024)
                      (control_trans): Sequential(
                        (0): Linear(in_features=1024, out_features=512, bias=True)
                        (1): Activation_Function_Class(
                          (f): Tanh()
                        )
                        (2): Linear(in_features=512, out_features=24576, bias=True)
                      )
                      (dropout): Dropout(p=0.0, inplace=False)
                    )
                  )
                )
              )
            )
          )
...

Am I correctly using the AdapterConfig? Thanks in advance!

May 27 '22 04:05 insomnia1996

Hey, regarding prefix tuning: The PrefixTuningPool is shared by all model layers, so its submodules appear in every layer. This does not mean prefixes are added to each layer.

Jul 26 '22 17:07 calpt

@calpt @hSterz , sorry to bother you here. Is this the same situation for adapters? Trying to leave_out some layers but they still show up to have adapters. And the number of trainable parameters remain the same.

May 19 '23 15:05 ajesujoba

Hey @ajesujoba. For bottleneck adapters, this should not be the case. If leave_out is added, no adapter modules should be added to the respective layers and the number of trainable parameters should change.

May 21 '23 16:05 calpt

adapters adapters copied to clipboard

How does leave_out parameter rank all layers?

adapters
adapters copied to clipboard