mechanistic-interpretability topic

List mechanistic-interpretability repositories

automated-explanations

29

Stars

5

Forks

Watchers

Explain a black-box module in natural language.

artificial-intelligence

automated-interpretability

codebook-features

37

Stars

1

Forks

Watchers

Sparse and discrete interpretability tool for neural networks

interpretability

pyvene

509

Stars

43

Forks

Watchers

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

activation-intervention

activation-patching

interpretability

interpretability-starter

51

Stars

1

Forks

Watchers

🧠 Starter templates for doing interpretability research

interpretability

interpretability-jam

mechanistic-interpretability

sparse-probing-paper

38

Stars

10

Forks

Watchers

Sparse probing paper full code.

interpretability

mechanistic-interpretability

universal-neurons

20

Stars

4

Forks

Watchers

Universal Neurons in GPT2 Language Models

interpretability

mechanistic-interpretability

deepdistilling

71

Stars

6

Forks

Watchers

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform...

deep-distilling

machine-learning

steering-vectors

25

Stars

2

Forks

Watchers

Steering vectors for transformer language models in Pytorch / Huggingface

steering-vectors

mechanistic-interpretability

causalgym

23

Stars

2

Forks

Watchers

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

interpretability

mechanistic-interpretability

llm-latent-language

26

Stars

6

Forks

Watchers

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

mechanistic-interpretability

multilingual-nlp