mechanistic-interpretability topic

List mechanistic-interpretability repositories

automated-explanations

36
Stars
6
Forks
Watchers

Generating and validating natural-language explanations.

codebook-features

51
Stars
2
Forks
Watchers

Sparse and discrete interpretability tool for neural networks

pyvene

627
Stars
61
Forks
Watchers

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

interpretability-starter

61
Stars
1
Forks
Watchers

🧠 Starter templates for doing interpretability research

universal-neurons

25
Stars
5
Forks
Watchers

Universal Neurons in GPT2 Language Models

deepdistilling

76
Stars
7
Forks
Watchers

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform...

steering-vectors

61
Stars
5
Forks
Watchers

Steering vectors for transformer language models in Pytorch / Huggingface

causalgym

40
Stars
5
Forks
Watchers

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

llm-latent-language

52
Stars
12
Forks
Watchers

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".