🚀 Feature request

Example code to produce the (supercool!) Adapter Fusion inter-Adapter attention plots in figure 5 from the paper AdapterFusion: Non-Destructive Task Composition for Transfer Learning.

Screenshot 2020-11-17 at 09 35 41

Motivation

Checking out the attention scores in the AdapterFusion module for analysis is exciting. But I didn't find it easy to create them. The challenge was accessing the relevant tensors and then creating them in the right format. Hence my request for some help ;).

Creating square attention plots from the attention tensor saved in BertFusion.recent_attention (https://github.com/Adapter-Hub/adapter-transformers/blob/master/src/transformers/adapter_modeling.py#L218). As far I understand, this tensor is of shape [batch_size, seq_len, num_adapters], and when I average over the first two dimensions (mean(0).mea(0), which I will do for all batches in the prediction data) I get a tensor of num_adapters floats that sums to 1.

Should I understand this to be the attention displayed in the above figure? But how do I get something of shape [num_adapters, num_adapters]?

Accessing the stored attention tensors from the Bert encoder during the prediction forward passes. I have been trying to trace up the BertFusion module through transformers.adapter_bert to understand where this modules ends up in the Bert model, and thus how I can access it from the top down. My guess from https://github.com/Adapter-Hub/adapter-transformers/blob/master/src/transformers/adapter_bert.py#L80 would be that

model.encoder.adapter_fusion_layer[adapter_fusion_name]

should give me the BertFusion module, which in turn would allow access to recent_attention after a prediction forward pass. But that does not seem to work. (If I believe correctly, because the model.encoder has no attribute adapter_fusion_layer.

How should I do this?

Your contribution

All that I have to contribute are the incomplete findings I shared above. But my guess is that the authors of Adapter Fusion would have some snippets lying around. I could turn those into an example snippet, in a notebook or something. Whatever you prefer!

Nov 17 '20 10:11 daandouwe

I agree, this is a quite useful feature of AdapterFusion!

For instance, we leverage the recent_attention in our AdapterDrop paper for pruning AdapterFusion (§4.2). Hence, in #84, we will clean this up and add documentation on how to read out the fusion weights.

To answer your questions:

Should I understand this to be the attention displayed in the above figure? But how do I get something of shape [num_adapters, num_adapters]?

That is correct. If you consider only one downstream task you could obtain a tensor of shape [n_layers, n_adapters, seq_len]. Averaging over the last dimension gives you: [n_layers, n_adapters]. In the AdapterFusion paper, I believe we also averaged over n_layers resulting in [n_adapters]. If you now repeat this for several tasks, you get [n_tasks, n_adapters], where n_tasks ==n_adapters in the special case you are referring to.

Accessing the stored attention tensors [...] How should I do this?

One way could be:

model.roberta.encoder.layer[layer_i].output.adapter_fusion_layer['<name of the fusion layer>'].recent_attention

We will add a cleaned-up variant of this soon.

Would that address your issue?

Nov 17 '20 15:11 arueckle

Thanks for the clarification! This addressed all my questions.

I will give this a try!

Nov 17 '20 17:11 daandouwe

adapters
adapters copied to clipboard

Example code for the Inter-Adapter attention plots in Adapter Fusion

🚀 Feature request

Motivation

Your contribution

adapters adapters copied to clipboard

Example code for the Inter-Adapter attention plots in Adapter Fusion

🚀 Feature request

Motivation

Your contribution

adapters
adapters copied to clipboard