adapters
adapters copied to clipboard
How does leave_out parameter rank all layers?
Hey, you can use the
leave_out
parameter of the AdapterConfig to achieve this. Theleave_out
parameter specifies in which layers no adapter module should be added. You can simply pass a list of the indices of all encoder layers and then they do not have an adapter module. I hope this helps.
Hi, I'm confused about the documentation of leave_out
: The IDs of the layers (starting at 0)
. Could you please explain how to rank the indices of all the layers? Taking 12-layer BART for an example, are encoder layers numbered from 0 to 11 and decoder layers 12-23? Thanks in advance!
Originally posted by @insomnia1996 in https://github.com/Adapter-Hub/adapter-transformers/issues/264#issuecomment-1094306657
Hi, yes you are right. For a BART model with 12 encoder layers and 12 decoder layers, the encoder layers would have ids 0 to 11 and the decoder layers would have ids 12 to 23.
Hi, yes you are right. For a BART model with 12 encoder layers and 12 decoder layers, the encoder layers would have ids 0 to 11 and the decoder layers would have ids 12 to 23.
Thanks!
Sorry to bother, but I still find prefixtuning layers are added in BART decoder when I've already set the leave_out by
ptuning_config = PrefixTuningConfig(cross_prefix=False, leave_out=[12,13,14,15,16,17,18,19,20,21,22,23])
model.add_adapter("prefixtuningconfig", config=ptuning_config)
The output is shown below:
BartForConditionalGeneration(
(shared_parameters): ModuleDict()
(model): BartModel(
(shared_parameters): ModuleDict()
(invertible_adapters): ModuleDict()
(shared): Embedding(50265, 1024, padding_idx=1)
(encoder): BartEncoder(
(invertible_adapters): ModuleDict()
(embed_tokens): Embedding(50265, 1024, padding_idx=1)
(embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
(layers): ModuleList(
(0): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
(prefix_tuning): PrefixTuningShim(
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict(
(mamconfig): PrefixTuningGroup(
(encoder_prefix): PrefixTuning(
(wte): Embedding(30, 1024)
(control_trans): Sequential(
(0): Linear(in_features=1024, out_features=512, bias=True)
(1): Activation_Function_Class(
(f): Tanh()
)
(2): Linear(in_features=512, out_features=24576, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
)
)
)
...
(decoder): BartDecoder(
(embed_tokens): Embedding(50265, 1024, padding_idx=1)
(embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
(layers): ModuleList(
(0): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
(prefix_tuning): PrefixTuningShim(
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict(
(mamconfig): PrefixTuningGroup(
(encoder_prefix): PrefixTuning(
(wte): Embedding(30, 1024)
(control_trans): Sequential(
(0): Linear(in_features=1024, out_features=512, bias=True)
(1): Activation_Function_Class(
(f): Tanh()
)
(2): Linear(in_features=512, out_features=24576, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
)
)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
(prefix_tuning): PrefixTuningShim(
(pool): PrefixTuningPool(
(prefix_tunings): ModuleDict(
(mamconfig): PrefixTuningGroup(
(encoder_prefix): PrefixTuning(
(wte): Embedding(30, 1024)
(control_trans): Sequential(
(0): Linear(in_features=1024, out_features=512, bias=True)
(1): Activation_Function_Class(
(f): Tanh()
)
(2): Linear(in_features=512, out_features=24576, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
)
)
)
)
)
)
...
Am I correctly using the AdapterConfig? Thanks in advance!
Hey, regarding prefix tuning: The PrefixTuningPool
is shared by all model layers, so its submodules appear in every layer. This does not mean prefixes are added to each layer.
@calpt @hSterz , sorry to bother you here. Is this the same situation for adapters? Trying to leave_out some layers but they still show up to have adapters. And the number of trainable parameters remain the same.
Hey @ajesujoba. For bottleneck adapters, this should not be the case. If leave_out is added, no adapter modules should be added to the respective layers and the number of trainable parameters should change.