bitsandbytes
bitsandbytes copied to clipboard
Fix for Pascal NaN redux
Force push over-rode but it isn't fixed.
I have tried this and it works. Please test it on your other cards.
All credit goes to: richardwth
@richardwth
fixes: https://github.com/TimDettmers/bitsandbytes/issues/165
Doesn't seem to work on my Tesla P40 card, unless oogabooga webui has some other underlying issue as well?
How can I confirm weather the issue is the webui or bnb?
I am getting RuntimeError: expected scalar type Half but found Float while trying to use 8-bit mode.
I built your patch-1 repository by running
CUDA_HOME=~/local/cuda-11.8 CUDA_VERSION=118 make cuda11x
then
sudo python3 setup.py install
Nevermind, its a bnb issue. I found the 8bit_test.py from that issue you linked and get the same RuntimeError: probability tensor contains either inf, nan or element < 0
The half vs float is something else. I tested this when doing inference from said webui. You built for all arch, I don't think it's default.
0cc4m's script tests HW matmul first and that SHOULD fail.. I will give it a try and see what happens.
This script, you mean? https://gist.github.com/0cc4m/a753b6a16a618cfbe747a74920dc50f6
Reading it, it also patches BnB.. that and is for a much previous version.
I did some testing.
Load the model:
Inference:
Output generated in 17.68 seconds (2.60 tokens/s, 46 tokens, context 71, seed 572183632) Output generated in 4.18 seconds (2.16 tokens/s, 9 tokens, context 68, seed 1482993057)
Training:
INFO:Loading raw text file dataset...
INFO:Getting model ready...
INFO:Prepping for training...
INFO:Creating LoRA model...
INFO:Starting training...
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.2
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
{'train_runtime': 734.5184, 'train_samples_per_second': 8.659, 'train_steps_per_second': 0.065, 'train_loss': 2.2761232058207193, 'epoch': 0.18}
INFO:LoRA training run is completed and saved.
INFO:Training interrupted.
It's slightly faster using the adamw_bnb_8bit optimizer. On this gen of card it will never be super great due to the lack of HW matmul... but hey, us and the $3k V100 people are in the same boat. :100:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
One change is good, the other would be a degradation in speed. Lets discuss how to fix this while maintaining speed for other GPUs.
Haven't really compared this recently with the new codebase. In one of the issues a commenter said that only the first change was required.
@Ph0rk0z I would love to make this PR actionable somehow but somehow I'm still struggling to understand what Tim means.
According to him one of the changes is good to go and the other "that adds lines which do 16 bit computation by casting the entire matrix to 16 bit that is more inefficient in many cases" needs improvement. Do you know what exactly he means and is this something we could wrap up together?
Maybe we can already merge the change that's good to go and handle the other one separately?
Force push over-rode but it isn't fixed. What do you mean by that? Is the commit in the PR the only thing to consider or is there something missing?
I think I understand the change in forward() but I'm struggling to understand what I see in backward().
The change in forward is at least limited in surface area to GPUs from Volta and older, so I suspect this is the part that is "good."
Has anyone run the unit tests with these changes?
Can be tried with just the fwd change to see if it still NaNs. I think people were saying it worked. I basically moved to GPTQ/GGUF and this languished a while so haven't been paying attention and re-testing. My bad. Sat so long I didn't think it would be accepted.
I removed backwards pass so people can try it. Haven't had time to test on my machine yet, I'm down to my P6000 and P100 here.