daejin

Results 1 comments of daejin
trafficstars

I'm looking for clarification on why the `query_pre_attn_scalar` value was changed from 224 (`d_model` / `# heads`) to `head_dim` 256 specifically for the 9B model in the [latest commit](https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e), while...