Easy-Transformer Add a demo of direct path patching

Add a demo of direct path patching

Open neelnanda-io opened this issue 2 years ago • 1 comments

Direct path patching is like activation patching, but rather than patching in the output of component A, it acts on pairs of components A and B (in a layer after A). And we only patch in the output of A into the input of B, and all other components see the old output of A. I want to add a section to Exploratory Analysis Demo demonstrating this for all pairs of heads.

Eg to do direct path patching on the query of head B, we'd add a hook saying patched_B_query = original_B_query + (clean_A_output - corrupted_A_output) @ W_Q / layer_norm_scale

For reference, an old PR to add it an early version of the library #49

Dec 19 '22 14:12 neelnanda-io

@callummcdougall has prepared a PR with patch patching which I expect will be added soon.

May 22 '23 08:05 jbloomAus

Easy-Transformer Easy-Transformer copied to clipboard

Add a demo of direct path patching

Easy-Transformer
Easy-Transformer copied to clipboard