mechanistic-interpretability topic

List mechanistic-interpretability repositories

DecisionTransformerInterpretability

65
Stars
16
Forks
Watchers

Interpreting how transformers simulate agents performing RL tasks

finetuning

17
Stars
2
Forks
Watchers

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

arrakis

17
Stars
1
Forks
Watchers

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

modelcomponents

112
Stars
5
Forks
Watchers

Decomposing and Editing Predictions by Modeling Model Computation

Language-Model-SAEs

32
Stars
6
Forks
Watchers

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.