Andrei-Aksionov
Andrei-Aksionov
I am using pyinstrument for profiling some ML/NLU specific tasks and noticed that when running profiling on function with a lot of simple computations inside memory consumption stedealy increases. For...
Instead of indexing positional embeddings we can slice them. It has couple of benefits: 1. Looks cleaner 2. When indexing - returns a new tensor (plus a new tensor each...
I want to talk about loss calculation in the forward method: ```python else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return...
During sampling when the model [is loaded](https://github.com/karpathy/nanoGPT/blob/master/sample.py#L49) from pretrained GPT2 weights, the dropout value is set to 0.0. I assume that was done because previously with pytorch 2.0 flash attention...
*Accidently messed up with the PR and the branch, so let's try one more time* I really don't like making such somewhat big PRs, but don't want to bombard with...
Previously if the value in train.py file had default None value (or try to override with None), the configurator failed to override this value with a new one because of...
This is an experiment (perhaps it needs to be a `Draft`?) to apply LoRA to not only to query and value matrices, but to: - query - key - value...
### Description & Motivation Hi there 👋 As @carmocca proposed, I would like to add functionality for BNB precision plugin to dequantize weights. Why it's needed? If anyone want's to...
I think this book should also contain analogy where we need to descend from the top of the hill down to the valley. The only difference between what I usually...
Hello there 👋 Thanks for the repo. But I have one question: why do we need to scale up (normalize) token embeddings? https://github.com/google/gemma_pytorch/blob/01062c9ef4cf89ac0c985b25a734164ede017d0b/gemma/model.py#L431-L432 Unfortunately, I cannot find an answer anywhere.