bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

Fix for Pascal NaN redux

Open Ph0rk0z opened this issue 2 years ago • 11 comments

Force push over-rode but it isn't fixed.

I have tried this and it works. Please test it on your other cards.

All credit goes to: richardwth

@richardwth

fixes: https://github.com/TimDettmers/bitsandbytes/issues/165

Ph0rk0z avatar May 17 '23 13:05 Ph0rk0z

Doesn't seem to work on my Tesla P40 card, unless oogabooga webui has some other underlying issue as well? How can I confirm weather the issue is the webui or bnb? I am getting RuntimeError: expected scalar type Half but found Float while trying to use 8-bit mode.

I built your patch-1 repository by running CUDA_HOME=~/local/cuda-11.8 CUDA_VERSION=118 make cuda11x then sudo python3 setup.py install

guyman624 avatar May 17 '23 17:05 guyman624

Nevermind, its a bnb issue. I found the 8bit_test.py from that issue you linked and get the same RuntimeError: probability tensor contains either inf, nan or element < 0

guyman624 avatar May 17 '23 19:05 guyman624

The half vs float is something else. I tested this when doing inference from said webui. You built for all arch, I don't think it's default.

0cc4m's script tests HW matmul first and that SHOULD fail.. I will give it a try and see what happens.

This script, you mean? https://gist.github.com/0cc4m/a753b6a16a618cfbe747a74920dc50f6

Reading it, it also patches BnB.. that and is for a much previous version.

Ph0rk0z avatar May 18 '23 11:05 Ph0rk0z

I did some testing.

Load the model:

ModelLoad

Inference:

Output generated in 17.68 seconds (2.60 tokens/s, 46 tokens, context 71, seed 572183632) Output generated in 4.18 seconds (2.16 tokens/s, 9 tokens, context 68, seed 1482993057)

inference

Training:

INFO:Loading raw text file dataset... INFO:Getting model ready... INFO:Prepping for training... INFO:Creating LoRA model... INFO:Starting training... wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.14.2 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. {'train_runtime': 734.5184, 'train_samples_per_second': 8.659, 'train_steps_per_second': 0.065, 'train_loss': 2.2761232058207193, 'epoch': 0.18} INFO:LoRA training run is completed and saved. INFO:Training interrupted.

training

It's slightly faster using the adamw_bnb_8bit optimizer. On this gen of card it will never be super great due to the lack of HW matmul... but hey, us and the $3k V100 people are in the same boat. :100:

Ph0rk0z avatar May 18 '23 12:05 Ph0rk0z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Dec 20 '23 15:12 github-actions[bot]

One change is good, the other would be a degradation in speed. Lets discuss how to fix this while maintaining speed for other GPUs.

TimDettmers avatar Jan 01 '24 17:01 TimDettmers

Haven't really compared this recently with the new codebase. In one of the issues a commenter said that only the first change was required.

Ph0rk0z avatar Jan 01 '24 19:01 Ph0rk0z

@Ph0rk0z I would love to make this PR actionable somehow but somehow I'm still struggling to understand what Tim means.

According to him one of the changes is good to go and the other "that adds lines which do 16 bit computation by casting the entire matrix to 16 bit that is more inefficient in many cases" needs improvement. Do you know what exactly he means and is this something we could wrap up together?

Maybe we can already merge the change that's good to go and handle the other one separately?

Force push over-rode but it isn't fixed. What do you mean by that? Is the commit in the PR the only thing to consider or is there something missing?

Titus-von-Koeller avatar Feb 29 '24 14:02 Titus-von-Koeller

I think I understand the change in forward() but I'm struggling to understand what I see in backward().

The change in forward is at least limited in surface area to GPUs from Volta and older, so I suspect this is the part that is "good."

Has anyone run the unit tests with these changes?

matthewdouglas avatar Feb 29 '24 20:02 matthewdouglas

Can be tried with just the fwd change to see if it still NaNs. I think people were saying it worked. I basically moved to GPTQ/GGUF and this languished a while so haven't been paying attention and re-testing. My bad. Sat so long I didn't think it would be accepted.

Ph0rk0z avatar Mar 01 '24 23:03 Ph0rk0z

I removed backwards pass so people can try it. Haven't had time to test on my machine yet, I'm down to my P6000 and P100 here.

Ph0rk0z avatar Mar 04 '24 11:03 Ph0rk0z