elk
elk copied to clipboard
Keeping language models honest by directly eliciting knowledge encoded in their activations.
This will enable ensembling etc.
Given a set of reporters for each layer of a model and a fixed input, we can extract the model's "belief" at each layer and see how it evolves over...
This is probably useful for the layer ensembling #60 and also I'd like to know how well correlated the loss is with accuracy in general
I noticed that the datasets supported in the code are all multiple-choice and classification types, such as IMDB, QNLI, and BoolQ. Can the code in this repository support free-form types...