Andy Arditi
Andy Arditi
# Description Previously, we were allocating causal masks of size `(n_ctx, n_ctx)` for every instantiation of `AbstractAttention`, where `n_ctx` corresponds to the _maximum_ context length. For models with a large...
### Proposal [Relatively minor proposal - considered making it a bug, but it's not *really* a bug.] In the initialization of each `Attention` module, we [register](https://github.com/neelnanda-io/TransformerLens/blob/ce82675a8e89b6d5e6229a89620c843c794f3b04/transformer_lens/components.py#L440C9-L440C20) a `causal_mask` buffer. This...