Easy-Transformer icon indicating copy to clipboard operation
Easy-Transformer copied to clipboard

[Proposal] Save and Load subgraph as dict

Open chris-aeviator opened this issue 10 months ago • 5 comments

Proposal

Add a structured representation of the subgraph/ circuit that can be saved and loaded similar to ACDC.

Motivation

When using TransformerLens I can find subgraphs through attribution patching. However, there seems to be no easy way to use the result further downstream due to the lack of structured representation. While I can (might?) piece together all components, it is not straightforward to get the result out for plotting, archiving, re-loading at a later point.

Pitch

Having a save_to_dict would solve the above 'shortcomings' in working further downstream with results gained from TL. It would enable loading a patched subgraph in a further step.

Alternatives

I have started finding and mapping values from Attribution_Patching_Demo.ipynb but not everything is clear, as compared to ACDC this example has a finder granularity (k,v,q,..)

Additional context

grafik

Checklist

  • [x] I have checked that there is no similar issue in the repo (required)

chris-aeviator avatar Apr 25 '24 10:04 chris-aeviator

Can you say more about what this would look like? I'm quite confused by the proposal. TransformerLens doesn't even have innate support for path patching, so I don't see what it would mean to support subgraphs. Could you eg give an example snippet of code for what you imagine using this feature to look like, and the interface?

On Thu, 25 Apr 2024 at 11:55, chris-aeviator @.***> wrote:

Proposal

Add a structured representation of the subgraph/ circuit that can be saved and loaded similar to ACDC. Motivation

When using TransformerLens I can find subgraphs through attribution patching. However, there seems to be no easy way to use the result further downstream due to the lack of structured representation. While I can (might?) piece together all components, it is not straightforward to get the result out for plotting, archiving, re-loading at a later point. Pitch

Having a save_to_dict would solve the above 'shortcomings' in working further downstream with results gained from TL. It would enable loading a patched subgraph in a further step. Alternatives

I have started finding and mapping values from Attribution_Patching_Demo.ipynb but not everything is clear, as compared to ACDC this example has a finder granularity (k,v,q,..) Additional context

grafik.png (view on web) https://github.com/neelnanda-io/TransformerLens/assets/11522213/26b8cd2c-719c-4f0c-aea2-b31c6f79b258 Checklist

— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/554, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKOQQY3AHS274LGOEN3Y7DOJRAVCNFSM6AAAAABGYUEF52VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3DGMRYGYZDAOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

neelnanda-io avatar Apr 25 '24 15:04 neelnanda-io

Sure - the confusion might be more on my side, but I try to explain:

When I look at the data in https://github.com/neelnanda-io/TransformerLens/blob/main/demos/Attribution_Patching_Demo.ipynb I have Head Inputs and Head outputs (see Head Path Attribution Patching), I also have MLP outputs. My thinking here is: with a known network architecture this should give me a subgraph/ circuit if I can construct the edges.

What I try to achieve is construction a graph / circuit that describes the patched model. An exemplary interface might be https://github.com/hannamw/EAP-IG/blob/main/eap/graph.py

I'm in the early days of building infrastucture to create and eval experiments via GPRC/API calls.

chris-aeviator avatar Apr 25 '24 15:04 chris-aeviator

CC @UFO-101 who is building a general automated interp library. In my opinion it's better to build a library on top of TL, rather than inside TL. What advantages would there be to adding this into TL? My vibe is that the majority of TL users are not doing automated interp (even if they should be >:) ...)

ArthurConmy avatar Apr 25 '24 15:04 ArthurConmy

+1 to Arthur, I think this would be a cool thing for someone to build on top

On Thu, 25 Apr 2024 at 16:55, Arthur Conmy @.***> wrote:

CC @UFO-101 https://github.com/UFO-101 who is building a general automated interp library. In my opinion it's better to build a library on top of TL, rather than inside TL. What advantages would there be to adding this into TL? My vibe is that the majority of TL users are not doing automated interp (even if they should be >:) ...)

— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/554#issuecomment-2077631083, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKNM7MOVXUBOFJMXMODY7ERPXAVCNFSM6AAAAABGYUEF52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXGYZTCMBYGM . You are receiving this because you commented.Message ID: @.***>

neelnanda-io avatar Apr 25 '24 15:04 neelnanda-io

@ArthurConmy thanks for pointing this out - yeah I don't wanna waste anyones time, feel free to close this if irrelevant.

while using TL you are doing what I'm looking for in ACDC, is there any conceptual difference (other than Act. patching vs. Attr. patching) that I don't see?

chris-aeviator avatar Apr 25 '24 15:04 chris-aeviator

FYI the library is now released: https://ufo-101.github.io/auto-circuit/ Post explaining how it works: https://www.lesswrong.com/posts/caZ3yR5GnzbZe2yJ3/how-to-do-patching-fast

UFO-101 avatar May 11 '24 20:05 UFO-101

@UFO-101 - awesome - thanks for sending a note :+1:

chris-aeviator avatar May 12 '24 10:05 chris-aeviator