keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Publish an Example Showing Intermediate Output from Pretrained Models

Open abheesht17 opened this issue 3 years ago • 1 comments

The last layer's hidden state is not always the best representation of text. In literature, output from intermediate layers is leveraged as well, to improve predictive performance.

Here is a notebook to get started: https://colab.research.google.com/drive/1mdodbRk6ayA6g0pCTWxoaQB3cJ8ecbjj?usp=sharing.

abheesht17 avatar Sep 15 '22 00:09 abheesht17

Thanks!

This is awesome to validate we do indeed have a working solution here, without needing to return a complex object from our modeling APIs with every intermediate output.

This is also more efficient, we don't have to hold references to this intermediate state in the case it's not being used. This is a nice plug for the functional style of Keras models.

Once we are putting up guides on pretrained models, definitely worth making an example showing this.

mattdangerw avatar Sep 15 '22 00:09 mattdangerw

Some thoughts:

  • Can you give some examples where the last layer is not used as a summary? I have not seen something like this since ELMO. It's just a lot of information if you think about it: shape=(num_layers, max_sequence_length, hidden_dim)
  • Wouldn't it be pretty heavyweight to always add this to the output? I could imagine implementing a special call which produced this output on request or adding an argument to call.
  • Would there be a big efficiency gain to indexing these tensors in advance rather than pulling them from model.layers at call? TFM indexes in advance, but also doesn't offer a functional model where the alternative is possible.
    • model.get_layer(name) is a linear scan so not super fast
    • model.get_layer(index) is fast but it might be difficult to compute the indices for an arbitrary model

jbischof avatar Oct 24 '22 23:10 jbischof

Can answer the first one right away. The last layer isn't always the best representation of text. The first few layers learn linear word order better, middle layers learn semantic features better (I hope I am not saying this the wrong way around). This is explained in the BERTology paper in Section 4.3: https://arxiv.org/abs/2002.12327. As for specific examples of not just using the last layer, let me dig up a paper or two for that

abheesht17 avatar Oct 24 '22 23:10 abheesht17

Oh, and also, if someone wants to probe BERT, representations from other layers other than the last might be useful

abheesht17 avatar Oct 24 '22 23:10 abheesht17

@jbischof I think you might be misunderstanding this issue? At least going off of your question.

Wouldn't it be pretty heavyweight to always add this to the output? I could imagine implementing a special call which produced this output on request or adding an argument to call.

There is no proposal to add anything to the output here. Or any API changes.

Rather, this is just an issue to track an example that shows how you could use the functional API to slice and dice a model. You can access intermediate output today. So this is a purely documentation issue (let's sing the praises of the functional API). And documentation on keras.io most likely, not in the library itself.

mattdangerw avatar Oct 25 '22 19:10 mattdangerw

Probing: https://nlp.stanford.edu/~johnhew/interpreting-probes.html

Papers: https://arxiv.org/abs/2210.11466 https://arxiv.org/abs/2201.03529 https://openreview.net/forum?id=N5lxfjtUPOS

abheesht17 avatar Oct 27 '22 18:10 abheesht17