mechanistic-interpretability topic
automated-explanations
Explain a black-box module in natural language.
codebook-features
Sparse and discrete interpretability tool for neural networks
pyvene
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
interpretability-starter
🧠Starter templates for doing interpretability research
sparse-probing-paper
Sparse probing paper full code.
universal-neurons
Universal Neurons in GPT2 Language Models
deepdistilling
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform...
steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".