mechanistic-interpretability topic
List
mechanistic-interpretability repositories
DecisionTransformerInterpretability
65
Stars
16
Forks
Watchers
Interpreting how transformers simulate agents performing RL tasks
finetuning
17
Stars
2
Forks
Watchers
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
Awesome-Interpretability-in-Large-Language-Models
256
Stars
16
Forks
Watchers
This repository collects all relevant resources about interpretability in LLMs
arrakis
17
Stars
1
Forks
Watchers
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
modelcomponents
112
Stars
5
Forks
Watchers
Decomposing and Editing Predictions by Modeling Model Computation
Language-Model-SAEs
32
Stars
6
Forks
Watchers
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.