awesome-llm-interpretability
awesome-llm-interpretability copied to clipboard
suggested adding PAIR work; and how to add new suggestions?
Your contributing doc link in the readme is broken :) So I'm making a suggestion here instead of as a pull request; but you might be interested in the PAIR team's work (pair.withgoogle.com and https://github.com/pair-code). In particular, we do a bunch of work on interpretability including:
Interactive Explorable visualizations (https://pair.withgoogle.com/explorables/) explaining important and interesting ML phenomena; of particular relevant to LLMs are:
- Do Machine Learning Models Memorize or Generalize? (won VISxAI best submission 2023, entering it into the VISxAI all of fame)
- What Have Language Models Learned? (won VISxAI best submission in 2021, entering it into the VISxAI all of fame)
Code/Tools: (The Learning Interpretability Toolkit/Tool) https://pair-code.github.io/lit/ a popular tool, especially in google, for using interpretability tools with ML models (most often used for language models, but works with many kinds of models and data).
Some recent papers on interpretability of language models by PAIR:
- “Interpretability Illusions in the Generalization of Simplified Models” – Dan Friedman, Andrew Lampinen, Lucas Dixon, Danqi Chen, Asma Ghandeharioun. [arxiv]
- (EMNLP 2024) "Self-Influence Guided Data Reweighting for Language Model Pre-training", M Thakkar, T Bolukbasi, S Ganapathy, S Vashishth, S Chandar, P Talukdar [arxiv]
- (EMNLP 2024). "Data Similarity is Not Enough to Explain Language Model Performance" - Greg Yauney, Emily Reif, David Mimno [acl]
- (NeurIPS 2023) "Post Hoc Explanations of Language Models Can Improve Language Models" [arxiv] - Satyapriya Krishna, Jiaqi Ma, Dylan Slack, Asma Ghandeharioun, Sameer Singh, Himabindu Lakkaraju
- NeurIPS 2023 Spotlight. "Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models" [arXiv, Tweet Summary] - Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
And there's a lot more here: https://pair.withgoogle.com/research/