Easy-Transformer
Easy-Transformer copied to clipboard
Add a demo of direct path patching
Direct path patching is like activation patching, but rather than patching in the output of component A, it acts on pairs of components A and B (in a layer after A). And we only patch in the output of A into the input of B, and all other components see the old output of A. I want to add a section to Exploratory Analysis Demo demonstrating this for all pairs of heads.
Eg to do direct path patching on the query of head B, we'd add a hook saying patched_B_query = original_B_query + (clean_A_output - corrupted_A_output) @ W_Q / layer_norm_scale
For reference, an old PR to add it an early version of the library #49
@callummcdougall has prepared a PR with patch patching which I expect will be added soon.