Alex Loftus
Alex Loftus
also preface
@ebridge2 This can probably be closed? I think we are having the copy-editor do this
@dirkgr Here's a pretty basic check on this. I got the activations in every layer for a single prompt, then averaged over batch and hidden dimension to get the average...
I also plotted pairwise $\lambda$ values where $\lambda = \frac{1}{n_{\text{layers}}} \log \frac{||v'||}{||v||}$, v' is layer i, and v is layer i+1: as well as the case where v' is the...
> Why does the mean of the activations start below zero? The mean of the weights of every module in `self_attn` in the first layer is (slightly) negative, and there's...
idk if we need this dude i'm strongly against scope creeping at this point
@ebridge2 did you do this?
Isn't the goal of Mojo to be a drop-in replacement for Python? How could that be achieved if Mojo is deviating from Python function names? I'd expect that most users...