Ludwig Sickert
Ludwig Sickert
Just want to add my two cents to this. After struggling to find the reason for this error within my own code, I finally noticed that one of my other...
@gsarti I was reading through the Jain and Wallace paper earlier and looking at their code, but I am not entirely sure what you mean with Last-Layer and Aggregated attention...
@gsarti Thanks a lot, that explains it. I was not sure if I was missing anything from the Jain and Wallace paper since they were introducing their methods for Adversarial...
@gsarti I think this concerns all attention methods, so I wanted to get your opinion on this before further implementing it: To run the attention-based methods, we need the `output_attentions=True`...
Ok then I will implement it like that for now
@gsarti How would you want to deal with the information of multiple attention heads? I have seen several methods being used here in the different papers of either using the...
Ok, thanks for the explanation. One follow-up question: How would you specify "max" in this context? Taking the head with the overall maximal attention values or using the max values...
@gsarti Sorry for all the questions, but there is another issue that came up: Since we are using the `generate()` method, most models I have tested have a defined number...
Hmm, I am not sure if I follow entirely. The main issue is that transformers is giving me all attention scores for all steps. If I understand it correctly now,...
Further possible additions to Basic Attention methods: - [x] rename LastLayerAttention to single-layer attention and make the layer configurable (last layer by default) - [x] allow users to choose a...