Tim Dettmers comments

Results 106 comments of


                                            Tim Dettmers

Fine-tuning GPT-J-6B in colab: 8-bit weights with low-rank adaptors

> 1. I'd like to discuss if we actually need LoRa adapters in the possible implementation. As I see it, they are not necessarily a part of the 8bit model....

Fine-tuning GPT-J-6B in colab: 8-bit weights with low-rank adaptors

> Did you mean to say something different here, Tim? Unless I misunderstood, int8 is already a single data type. Currently, the bnb quantization by default uses dynamic block-wise quantization...

LLaMA Implementation

> Strangely, int8 (LLM.int8 to be specific) for 65B model works like a charm on A100, but leads to bad results on V100 with abnormally short generated sequences. I will...

Add sm_90a to enable use of accelerated wgmma and setmaxnreg instructions

Thank you, this is good to know that wgmma is now added. I think Hopper supports both sm_90 and sm_90a. Since we do not make use of wgmma or setmaxnreg...

Using sparse learning in practice

Hi! Thanks for your questions. 1. The mask scheduler is different from the learning rate scheduler. The learning rate scheduler should be unaffected by the code. 2. That is correct....

Feature request: Please, add implementation for Novograd algorithm

I had a look before at this. It is a bit more work and I will probably not focus on it for the next release, but it's good to know...

Check dtype of input tensors is correct

Thanks for this PR! I am currently preparing a major overhaul of these algorithms and interfaces. I have to check how to best integrate this PR.

condition to regrow the connection!!

The "missing" weight are weights that are 0. The gradient can still be calculated from those weights and the momentum is just the exponential mean of these gradients over time....

Learning rate for Imagenet

Good catch, I was not aware of this behavior. I did not change the code any further and trained on 4 GPUs. I have not studied the performance difference in...

Usage of Sparse Tensors ?

I was quite interested in using sparse tensors myself, and I also worked on sparse GPU algorithms, but unfortunately, it is a very complicated issue, especially if you want to...