Neel Nanda comments

Results 35 comments of


                                            Neel Nanda

Add mixed precision inference incl loading

I believe that Pythia 70m can have attention scores as low as -100,000, which will get you nans in float16 because those can do max -65,536. Honestly, my take is...

Add mixed precision inference incl loading

I've also observed this - my weak guess is that it's due to implementation details like the use of einsum and reshaping of attention matrices? I'm not super sure otherwise...

Add mixed precision inference incl loading

Interesting! Note that Pythia uses rotary attention, where b_K does matter (the key gets rotated by the difference in positions, so it doesn't cancel out between different source tokens) On...

Better docs for model properties

Bonus for doing this in a way that can be automatically updated as more models are added, but this is less important

Better docs for model properties

This function will [automatically get you the config](https://github.com/neelnanda-io/TransformerLens/blob/0ffcc8ad647d9e991f4c2596557a9d7475617773/transformer_lens/loading_from_pretrained.py#L659). I just used it to automatically generate that janky table, I agree it could be much better! We can create a script...

[Bug Report] Residual Stack Not Adding Up

You haven't folded the LayerNorm weights, so need to do (final_residual_post_ln * model.ln_final.w + model.ln_final.b) @ model.W_U + model.b_U I don't recall if LLaMA has LayerNorm weight folding implemented, but...

Make `Clean_Transformer_Demo.ipynb` (linked in README) compatible with the latest version of `TransformerLens`

Arthur added a PR that changes the internal implementation of the model, so attention layers have separate inputs for query, key and value. The testing code uses a particular layer...

Make `Clean_Transformer_Demo.ipynb` (linked in README) compatible with the latest version of `TransformerLens`

Note that Callum McDougall has an updated version of the tutorial here: https://transformerlens-intro.streamlit.app/Transformer_from_scratch On Mon, 22 May 2023 at 09:21, Joseph Bloom ***@***.***> wrote: > @erlebach can you please let...

[Bug Report] Layer norm folding not properly implemented for BertBlock

Ah, no, LayerNorm should not be folded at all. You cannot fold it into W_O, because that would change the norm of the output of the layer and thus the...

[Proposal] Save and Load subgraph as dict

Can you say more about what this would look like? I'm quite confused by the proposal. TransformerLens doesn't even have innate support for path patching, so I don't see what...