Enrico Shippole comments

Results 156 comments of


                                            Enrico Shippole

Training Loss and Experiments

@lucidrains I am using the current CosineSimCausalTransformer available in the repository for the GPT-2 run. I believe the architecture used post-norm layers with DeepNorm. I did not see a specific...

Training Loss and Experiments

@lucidrains If I missed something or you want me to add a PreNorm to the Attention layer. I am more than willing to test with that as well.

Training Loss and Experiments

@lucidrains Here are the results for fp16 training **without** pre-layernorm for 30k steps on an A100 (40GB). The recent update greatly improved numerical stability for fp16 training. Training loss: ![Screenshot...

Training Loss and Experiments

@lucidrains Here are the results for fp16 training **with** pre-layernorm for 30k steps on an A100 (40GB). Training remained more stable as well. I changed to validating every 10 steps...

Training Loss and Experiments

Here are the results for fp32 training **with** pre-layernorm for 30k steps on an A100 (40GB). Training loss: ![Screenshot from 2022-11-02 11-38-00](https://user-images.githubusercontent.com/25208228/199533659-743b354a-b30a-48b4-ab9f-0c3fd79df768.png) Validation Loss (Validating every 10 steps): ![Screenshot from...

Training Loss and Experiments

Sidenote I am going to spend the next weeks working on a Triton version of Flash Cosine Similarity Attention as well. I think it would be an interesting comparative benchmark!

Training Loss and Experiments

@lucidrains Of course! Here are the grouped graphs. fp16 training **without** pre-layernorm (validating every 100) ![Screenshot from 2022-11-02 12-28-39](https://user-images.githubusercontent.com/25208228/199546352-b18e0375-207b-460e-a840-d1720d4b55c9.png) fp16 training **with** pre-layernorm (validating every 10) ![Screenshot from 2022-11-02 12-29-55](https://user-images.githubusercontent.com/25208228/199546678-cf8fb98a-c4d2-4b59-8e52-4ba4e8a6a780.png)...

Training Loss and Experiments

Results for training standard PaLM on an A100 (40 GB) for 30k steps: - 2.15it/s - Sequence Length 1024 - fp32 ![Screenshot from 2022-11-02 17-56-01](https://user-images.githubusercontent.com/25208228/199610341-98acbfd7-9d58-43a7-833f-3d3d974a0f08.png)

Training Loss and Experiments

@lucidrains Here is the code for the PaLM model with flash cosine sim attention. The model is currently training and I will update the results likely later tonight or tomorrow...

Training Loss and Experiments

@lucidrains I will make the adjustments to the model to do the l2 normalization before the rotation of the queries and keys, and post the results. Thank you, Enrico