mechanistic-interpretability topic

List mechanistic-interpretability repositories

codebook-features

37
Stars
1
Forks
Watchers

Sparse and discrete interpretability tool for neural networks

pyvene

509
Stars
43
Forks
Watchers

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

interpretability-starter

51
Stars
1
Forks
Watchers

🧠 Starter templates for doing interpretability research

universal-neurons

20
Stars
4
Forks
Watchers

Universal Neurons in GPT2 Language Models

deepdistilling

71
Stars
6
Forks
Watchers

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform...

steering-vectors

25
Stars
2
Forks
Watchers

Steering vectors for transformer language models in Pytorch / Huggingface

causalgym

23
Stars
2
Forks
Watchers

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

llm-latent-language

26
Stars
6
Forks
Watchers

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".