Andrei-Aksionov issues

Results 32 issues of


                                            Andrei-Aksionov

High memory consumption for long-running tasks

I am using pyinstrument for profiling some ML/NLU specific tasks and noticed that when running profiling on function with a lot of simple computations inside memory consumption stedealy increases. For...

Teeny-tiny performance improvement

Instead of indexing positional embeddings we can slice them. It has couple of benefits: 1. Looks cleaner 2. When indexing - returns a new tensor (plus a new tensor each...

Loss calculation should not permanently change shapes of logits and targets

I want to talk about loss calculation in the forward method: ```python else: B, T, C = logits.shape logits = logits.view(B*T, C) targets = targets.view(B*T) loss = F.cross_entropy(logits, targets) return...

GPT2 dropout from huggingface is 0.1

During sampling when the model [is loaded](https://github.com/karpathy/nanoGPT/blob/master/sample.py#L49) from pretrained GPT2 weights, the dropout value is set to 0.0. I assume that was done because previously with pytorch 2.0 flash attention...

Model.py simplifications

*Accidently messed up with the PR and the branch, so let's try one more time* I really don't like making such somewhat big PRs, but don't want to bombard with...

Configurator will work correctly with default None values

Previously if the value in train.py file had default None value (or try to override with None), the configurator failed to override this value with a new one because of...

Lora applied to all

This is an experiment (perhaps it needs to be a `Draft`?) to apply LoRA to not only to query and value matrices, but to: - query - key - value...

Dequantize model with Bitsandbytes precision plugin

### Description & Motivation Hi there 👋 As @carmocca proposed, I would like to add functionality for BNB precision plugin to dequantize weights. Why it's needed? If anyone want's to...

feature

design

precision: bnb

Analogy that explains intuition of gradient descent

I think this book should also contain analogy where we need to descend from the top of the hill down to the valley. The only difference between what I usually...

[Question] Embeddings normalization by sqrt(hidden_size)

Hello there 👋 Thanks for the repo. But I have one question: why do we need to scale up (normalize) token embeddings? https://github.com/google/gemma_pytorch/blob/01062c9ef4cf89ac0c985b25a734164ede017d0b/gemma/model.py#L431-L432 Unfortunately, I cannot find an answer anywhere.