Neel Nanda

Results 35 comments of Neel Nanda

I believe that Pythia 70m can have attention scores as low as -100,000, which will get you nans in float16 because those can do max -65,536. Honestly, my take is...

I've also observed this - my weak guess is that it's due to implementation details like the use of einsum and reshaping of attention matrices? I'm not super sure otherwise...

Interesting! Note that Pythia uses rotary attention, where b_K does matter (the key gets rotated by the difference in positions, so it doesn't cancel out between different source tokens) On...

Bonus for doing this in a way that can be automatically updated as more models are added, but this is less important

This function will [automatically get you the config](https://github.com/neelnanda-io/TransformerLens/blob/0ffcc8ad647d9e991f4c2596557a9d7475617773/transformer_lens/loading_from_pretrained.py#L659). I just used it to automatically generate that janky table, I agree it could be much better! We can create a script...

You haven't folded the LayerNorm weights, so need to do (final_residual_post_ln * model.ln_final.w + model.ln_final.b) @ model.W_U + model.b_U I don't recall if LLaMA has LayerNorm weight folding implemented, but...

Arthur added a PR that changes the internal implementation of the model, so attention layers have separate inputs for query, key and value. The testing code uses a particular layer...

Note that Callum McDougall has an updated version of the tutorial here: https://transformerlens-intro.streamlit.app/Transformer_from_scratch On Mon, 22 May 2023 at 09:21, Joseph Bloom ***@***.***> wrote: > @erlebach can you please let...

Ah, no, LayerNorm should not be folded at all. You cannot fold it into W_O, because that would change the norm of the output of the layer and thus the...

Can you say more about what this would look like? I'm quite confused by the proposal. TransformerLens doesn't even have innate support for path patching, so I don't see what...