Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

`/refusal`

Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.

Red-teaming language models via activation engineering

`/sycophancy`

Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.

Modulating sycophancy in an RLHF model via activation steering
Reducing sycophancy and improving honesty via activation steering
Understanding and visualizing sycophancy datasets

`/steering`

Activation addition experiments (pure act-adds from single forward passes)

Activation adding experiments with llama-7b
Activation adding experiments with FLAN-T5

`/intermediate_decoding`

Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)

Decoding intermediate activations in llama-2-7b

Other directories

`/data_generation`

Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs

`/probability_calibration`

Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction

`/unlearning`

Early stage attempt at Google's Machine Unlearning Challenge

LM-exp
LM-exp copied to clipboard

Metadata

Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

`/refusal`

`/sycophancy`

`/steering`

`/intermediate_decoding`

Other directories

`/data_generation`

`/probability_calibration`

`/unlearning`

← Metadata

Owner

Metadata

LM-exp LM-exp copied to clipboard

Metadata

Experiments done during SERI MATS (Summer 2023)

Relation to research writeups

/refusal

/sycophancy

/steering

/intermediate_decoding

Other directories

/data_generation

/probability_calibration

/unlearning

← Metadata

Owner

Metadata

LM-exp
LM-exp copied to clipboard

`/refusal`

`/sycophancy`

`/steering`

`/intermediate_decoding`

`/data_generation`

`/probability_calibration`

`/unlearning`