Publish an Example Showing Intermediate Output from Pretrained Models
The last layer's hidden state is not always the best representation of text. In literature, output from intermediate layers is leveraged as well, to improve predictive performance.
Here is a notebook to get started: https://colab.research.google.com/drive/1mdodbRk6ayA6g0pCTWxoaQB3cJ8ecbjj?usp=sharing.
Thanks!
This is awesome to validate we do indeed have a working solution here, without needing to return a complex object from our modeling APIs with every intermediate output.
This is also more efficient, we don't have to hold references to this intermediate state in the case it's not being used. This is a nice plug for the functional style of Keras models.
Once we are putting up guides on pretrained models, definitely worth making an example showing this.
Some thoughts:
- Can you give some examples where the last layer is not used as a summary? I have not seen something like this since ELMO. It's just a lot of information if you think about it:
shape=(num_layers, max_sequence_length, hidden_dim) - Wouldn't it be pretty heavyweight to always add this to the output? I could imagine implementing a special
callwhich produced this output on request or adding an argument tocall. - Would there be a big efficiency gain to indexing these tensors in advance rather than pulling them from
model.layersat call? TFM indexes in advance, but also doesn't offer a functional model where the alternative is possible.model.get_layer(name)is a linear scan so not super fastmodel.get_layer(index)is fast but it might be difficult to compute the indices for an arbitrary model
Can answer the first one right away. The last layer isn't always the best representation of text. The first few layers learn linear word order better, middle layers learn semantic features better (I hope I am not saying this the wrong way around). This is explained in the BERTology paper in Section 4.3: https://arxiv.org/abs/2002.12327. As for specific examples of not just using the last layer, let me dig up a paper or two for that
Oh, and also, if someone wants to probe BERT, representations from other layers other than the last might be useful
@jbischof I think you might be misunderstanding this issue? At least going off of your question.
Wouldn't it be pretty heavyweight to always add this to the output? I could imagine implementing a special call which produced this output on request or adding an argument to call.
There is no proposal to add anything to the output here. Or any API changes.
Rather, this is just an issue to track an example that shows how you could use the functional API to slice and dice a model. You can access intermediate output today. So this is a purely documentation issue (let's sing the praises of the functional API). And documentation on keras.io most likely, not in the library itself.
Probing: https://nlp.stanford.edu/~johnhew/interpreting-probes.html
Papers: https://arxiv.org/abs/2210.11466 https://arxiv.org/abs/2201.03529 https://openreview.net/forum?id=N5lxfjtUPOS