Tim Dettmers

Results 106 comments of Tim Dettmers

> 1. I'd like to discuss if we actually need LoRa adapters in the possible implementation. As I see it, they are not necessarily a part of the 8bit model....

> Did you mean to say something different here, Tim? Unless I misunderstood, int8 is already a single data type. Currently, the bnb quantization by default uses dynamic block-wise quantization...

> Strangely, int8 (LLM.int8 to be specific) for 65B model works like a charm on A100, but leads to bad results on V100 with abnormally short generated sequences. I will...

Thank you, this is good to know that wgmma is now added. I think Hopper supports both sm_90 and sm_90a. Since we do not make use of wgmma or setmaxnreg...

Hi! Thanks for your questions. 1. The mask scheduler is different from the learning rate scheduler. The learning rate scheduler should be unaffected by the code. 2. That is correct....

I had a look before at this. It is a bit more work and I will probably not focus on it for the next release, but it's good to know...

Thanks for this PR! I am currently preparing a major overhaul of these algorithms and interfaces. I have to check how to best integrate this PR.

The "missing" weight are weights that are 0. The gradient can still be calculated from those weights and the momentum is just the exponential mean of these gradients over time....

Good catch, I was not aware of this behavior. I did not change the code any further and trained on 4 GPUs. I have not studied the performance difference in...

I was quite interested in using sparse tensors myself, and I also worked on sparse GPU algorithms, but unfortunately, it is a very complicated issue, especially if you want to...