mechanistic-interpretability topics

DecisionTransformerInterpretability

65

Stars

16

Forks

Watchers

Interpreting how transformers simulate agents performing RL tasks

jbloomAus

mechanistic-interpretability

reinforcement-learning

finetuning

17

Stars

2

Forks

Watchers

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

Nix07

entity-tracking

finetuning

mechanistic-interpretability

science-of-deep-learning

Awesome-Interpretability-in-Large-Language-Models

256

Stars

16

Forks

Watchers

This repository collects all relevant resources about interpretability in LLMs

ruizheliUOA

dictionary-learning

interpretability-and-explainability

mechanistic-interpretability

sparse-autoencoder

arrakis

17

Stars

1

Forks

Watchers

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

yash-srivastava19

anthropic

explainable-ai

garcon

mechanistic-interpretability

modelcomponents

112

Stars

5

Forks

Watchers

Decomposing and Editing Predictions by Modeling Model Computation

MadryLab

attribution

interpretability

mechanistic-interpretability

model-editing

Language-Model-SAEs

32

Stars

6

Forks

Watchers

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.

OpenMOSS

interpretability

mechanistic-interpretability

sparse-autoencoders

sparse-dictionary