multifit
multifit copied to clipboard
Self attention for pooling linear classifier
Add a BiAttentionPoolingClassifier
(self attention for pooling linear classifier) as in Attention is all you need following the discussion with @sebastianruder in Teams.
I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to self.layers
is of the right shape (but I'm not quite sure as of now).
I'll shift all the stuff to Colab for testing and see if it's any help.
The OOM issue persists even on Colab with 11GB of GPU memory.
RuntimeError: CUDA out of memory. Tried to allocate 8.41 GiB (GPU 0; 11.17 GiB total capacity; 10.26 GiB already allocated; 518.56 MiB free; 80.50 MiB cached)
It appears that I have run into a memory leak.
I am beginning to implement various options of attention on top of ulmfit, so obviously I've looked at this code. I do not really understand how it is used here.
-
I thought that attention would be used along the sequence length, on different RNN outputs, kind of instead of the mean/max pooling and taking the last ouput.
-
As mentioned, I thought about using attention instead of pooling, reducing the dimensions of the network. Here it is used with key=value=query - so if I understand correctly, it preserves the dimensionality, calculating the representations of each item in context of all the other items? I guess I just don't understand, is there an intuitive explanation of what it does?
-
I thought about using attention in the way I have described above. In such a case, I think the query tensor should be learnable (or multiple tensors for multiple heads). Since this is a classification scenario, the mechanism I want to achieve is: attention returning the most relevant RNN outputs for the classification task at hand (instead of taking a mean/max...). Does it make sense?
@tpietruszka
One intuitive reason why I think this could be helpful is that the way we had planned on using XNLI was by concatenating the premise and the hypothesis -- so it is possible that we learn some premise-to-hypothesis attention through it. What do you think?
Of course, the dimensionality is preserved but I don't think that's a big problem.
I agree that a more "meaningful" way of applying attention is to attend on the hidden layer outputs from the forward and backward LMs. In fact, applying attention to the concatenation of the pooling outputs was somewhat foolish of me.
What I'll do instead is only attend to a concatenation of the forward and backward LM outputs and also reduce the number of attention heads (which should solve the memory problem). I'll work on it this weekend and update.
Feel free to add to this PR if you have ideas on improving. If you'd like to try a different experiment with attention, that's great too!
@dust0x I think all approaches are worth testing...
Recently I have been experimenting with different variants of attention, applied to the LM outputs before pooling, on the imdb task. I've pushed 2 variants to a small (for now messy) repo ulmfit_experiments - maybe it could be of help somehow.
Some observations:
- whatever I do, I seem to end up with accuracy between 94 and 95. Yes, both uni- and bi-directional models. It is quite frustrating.
- I think attention might be helpful where there is less labeled examples, but it needs further testing
- one possible interpretation of the fact that the classifier head's architecture does not change much: the 'bottleneck' is the language model, not the classifier head. But then again adding bidirectionality should help, but it does not.
- in early versions I also had a GPU memory leak. It seems to be solved now, not sure how. I think it was related to some parameters not being correctly registered as a module (and I guess not de-allocated when appropriate).
Please let me know if you have any thoughts on the subject
The memory "leak" was my own fault. I changed the way I was using attention and that fixed it.
Self attention module seems to be working okay on the tests I ran locally, I'll start the bench-marking now. @sebastianruder @PiotrCzapla are there any specific datasets that you would like to see the results on?
@tpietruszka it's very odd that you should get the same accuracy on IMDb across all your experiments. Is it possible that somewhere the classification head is hard-coded to BiPoolingLinearClassifier
and it's defaulting to it every time? I'm only suggesting this because this was something that came up when I was experimenting too, and of course it's possible that you have already checked.