Easy-Transformer icon indicating copy to clipboard operation
Easy-Transformer copied to clipboard

(Draft) Add DLA function to utils

Open VasilGeorgiev39 opened this issue 1 year ago • 5 comments

Description

DLA is usually the first step we do in a new exploration. I think it would be nice to have a common function that does it in a single step.

Let me know if you think this does not generalize well enough or if you have other concerns.

Not sure if Utils is the right place for it tho, maybe we can create a new module that will hold the mech interp toolkit?

If it looks good I'll write tests and stuff.

Type of change

Please delete options that are not relevant.

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] This change requires a documentation update

Checklist:

  • [x] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] New and existing unit tests pass locally with my changes
  • [x] I have not rewritten tests relating to key interfaces which would affect backward compatibility

VasilGeorgiev39 avatar Dec 16 '23 04:12 VasilGeorgiev39

Thanks for starting on this - it seems useful and I agree that it should be it's own file (probably just for DLA as it'll become quite large once it's fully documented etc).

In general I agree as well that it's probably worth expanding this a bit to work more generally. Specifically you can break DLA down recursively e.g. by attention layer -> attention head -> source layer -> source component... It would be nice to hae this as well.

Hope that makes sense and if you are unsure about how to abstract more I'm happy to have a chat about it!

alan-cooney avatar Jan 17 '24 22:01 alan-cooney

Hi @alan-cooney, thanks for the comment. I have a couple questions:

I can get the attention head contributions (or even the mlp neurons) with get_full_resid_decomposition(), however I can get the correct and incorrect directions only for the residual stream with tokens_to_residual_directions(). How can I get the directions for the individual heads (or even neurons) ?

Also, what do you mean by break down by 'source layer' and 'source component' ?

VasilGeorgiev39 avatar Feb 05 '24 03:02 VasilGeorgiev39

@VasilGeorgiev39 Are you still available to wrap this up?

bryce13950 avatar Apr 27 '24 16:04 bryce13950

@bryce13950 Yes, I will be available after the 9th of May. What do you think would be the best approach for this?

VasilGeorgiev39 avatar May 01 '24 10:05 VasilGeorgiev39

I am not quite sure. Alan has been pulled away for his full time job in the last few months. I have reached out to him separately to see if he can clarify the comments on this, but I haven't heard back via slack. I don't really get what he means by source layer and source component either. Maybe we can start by turning it into its own module, and then seeing where it can be generalized. I do like your idea of setting it up as a tool, and I am likely going to be doing just that in another context. Do you want to move this into it's own module in a directly named tools?

bryce13950 avatar May 02 '24 23:05 bryce13950