Enrico Shippole
Enrico Shippole
@lucidrains I am using the current CosineSimCausalTransformer available in the repository for the GPT-2 run. I believe the architecture used post-norm layers with DeepNorm. I did not see a specific...
@lucidrains If I missed something or you want me to add a PreNorm to the Attention layer. I am more than willing to test with that as well.
@lucidrains Here are the results for fp16 training **without** pre-layernorm for 30k steps on an A100 (40GB). The recent update greatly improved numerical stability for fp16 training. Training loss:  Validation Loss (Validating every 10 steps):  fp16 training **with** pre-layernorm (validating every 10) ...
Results for training standard PaLM on an A100 (40 GB) for 30k steps: - 2.15it/s - Sequence Length 1024 - fp32 
@lucidrains Here is the code for the PaLM model with flash cosine sim attention. The model is currently training and I will update the results likely later tonight or tomorrow...
@lucidrains I will make the adjustments to the model to do the l2 normalization before the rotation of the queries and keys, and post the results. Thank you, Enrico