LM-exp
LM-exp copied to clipboard
LLM experiments done during SERI MATS - focusing on activation steering / interpreting activation spaces
Experiments done during SERI MATS (Summer 2023)
Relation to research writeups
/refusal
Activation steering with a "refusal vector" to cause llama-2-chat model to stop refusing to answer harmful questions.
/sycophancy
Activation steering to modulate sycophancy in llama-2-chat and llama-2 base model.
-
Modulating sycophancy in an RLHF model via activation steering
-
Reducing sycophancy and improving honesty via activation steering
/steering
Activation addition experiments (pure act-adds from single forward passes)
/intermediate_decoding
Logit-lens experiments (directly decoding intermediate activations by passing them through unembedding layer)
Other directories
/data_generation
- Code for generating LLM-generated datasets using gpt-4, 3.5 and Claude APIs
/probability_calibration
- Early stage experiments to try and measure whether LLMs are aware of their internal uncertainty over a prediction
/unlearning
- Early stage attempt at Google's Machine Unlearning Challenge