bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

Using nerdy rodent's dreamlab training, I have error on training about cuda.

Open 311-code opened this issue 2 years ago • 7 comments

I am using Nerdy Rodent's dreamlab local install video which I have followed step by step, at the end bitsandbytes seems to give an error. I tried reloading all the CUDA stuff and tried the new 11.8 cuda version which seems to differ from video and still gives same error:

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:86: UserWarning: /home/user/anaconda3/envs/diffusers did not contain libcudart.so as expected! Searching further paths... warn( /home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('CompVis/stable-diffusion-v1-4')} warn( CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! Traceback (most recent call last): File "/home/user/github/diffusers/examples/dreambooth/train_dreambooth.py", line 657, in main() File "/home/user/github/diffusers/examples/dreambooth/train_dreambooth.py", line 446, in main import bitsandbytes as bnb File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/init.py", line 6, in from .autograd._functions import ( File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 5, in import bitsandbytes.functional as F File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/functional.py", line 13, in from .cextension import COMPILED_WITH_CUDA, lib File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 41, in lib = CUDALibrary_Singleton.get_instance().lib File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 37, in get_instance cls._instance.initialize() File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cextension.py", line 15, in initialize binary_name = evaluate_cuda_setup() File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 132, in evaluate_cuda_setup cc = get_compute_capability(cuda) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 105, in get_compute_capability ccs = get_compute_capabilities(cuda) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py", line 83, in get_compute_capabilities check_cuda_result(cuda, cuda.cuDeviceGetCount(ctypes.byref(nGpus))) AttributeError: 'NoneType' object has no attribute 'cuDeviceGetCount' Traceback (most recent call last): File "/home/user/anaconda3/envs/diffusers/bin/accelerate", line 8, in sys.exit(main()) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main args.func(args) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command simple_launcher(args) File "/home/user/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/diffusers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--use_auth_token', '--instance_data_dir=training', '--output_dir=classes', '--instance_prompt=A sks dog', '--resolution=512', '--center_crop', '--train_batch_size=1', '--mixed_precision=no', '--use_8bit_adam', '--gradient_accumulation_steps=1', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--sample_batch_size=4', '--max_train_steps=800']' returned non-zero exit status 1.

311-code avatar Oct 04 '22 20:10 311-code

11.8 isn't currently supported, you might try an older CUDA library version I'd go with 11.6 or earlier.

Thomas-MMJ avatar Oct 05 '22 22:10 Thomas-MMJ

11.8 isn't currently supported, you might try an older CUDA library version I'd go with 11.6 or earlier.

Screenshot_5

Same error and i'm on 11.7:

Screenshot_6

  • diffusers version: 0.4.0.dev0
  • Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
  • Python version: 3.9.13
  • PyTorch version (GPU?): 1.12.1+cu116 (True)
  • Huggingface_hub version: 0.10.0
  • Transformers version: 4.22.2
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

GPU: 1080 ti

How i downgrade to 11.6, just copy this commands:

Screenshot_8

and it will downgrade or need to uninstall Ubuntu and start all over again?

Or need to deleted everything CUDA related with this commands?

Even with those commands, the issue wasn’t solved.
Eventually, the fastest way to fix 2 machines with a package manager is to purge all Nvidia & Cuda,did it by:

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'

ZeroCool22 avatar Oct 07 '22 03:10 ZeroCool22

@brentjohnston

What GPU you have and what you selected on accelerate config when asking [NO/fp16/bf16]?

PD: I tried different selections but nothing changed.

ZeroCool22 avatar Oct 07 '22 03:10 ZeroCool22

11.8 isn't currently supported, you might try an older CUDA library version I'd go with 11.6 or earlier.

Can confirm that with CUDA 11.6 it works, at least with a 1080 TI.

Screenshot_9

WSL + Ubuntu DB Working CUDA 11 6!

The guide of nerdy rodent's use 11.7 on the Pastebin and in the video he shows 11.8, so none of them will work, following that part it will never have worked.

ZeroCool22 avatar Oct 07 '22 04:10 ZeroCool22

In the video, pastebin and on my system I use CUDA 11.7.1. - typically Nvidia updated the day after ;) You'll need to ensure your MS Windows system is up-to-date as well. If you have old Nvidia drivers in MS Windows you may need to downgrade CUDA.

Where it says CUDA SETUP: WARNING! libcuda.so not found! Do you have a CUDA driver installed? If you are on a cluster, make sure you are on a CUDA machine! you need to reboot / add the line as stated in the video & shown in pastebin file: export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

nerdyrodent avatar Oct 07 '22 08:10 nerdyrodent

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version.

The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file.

Even if you reboot, it will still not find CUDA if that line is not added.

But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

ZeroCool22 avatar Oct 07 '22 11:10 ZeroCool22

This is super helpful — thank you, everyone! I will add CUDA 11.8 as soon as possible!

TimDettmers avatar Oct 10 '22 01:10 TimDettmers

CUDA 11.8 was added in the lastest release. I also added code that gives some compilation and debugging instructions if the CUDA setup fails.

TimDettmers avatar Oct 27 '22 14:10 TimDettmers

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version.

The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file.

Even if you reboot, it will still not find CUDA if that line is not added.

But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

Sorry to bother, but for us tech newbies, how does one do that?

Spaceisprettybig avatar Feb 17 '23 16:02 Spaceisprettybig

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version. The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file. Even if you reboot, it will still not find CUDA if that line is not added. But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

Sorry to bother, but for us tech newbies, how does one do that?

In your train file:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH export MODEL_NAME="darkstorm2150/Protogen_x3.4_Official_Release" export INSTANCE_DIR="training" export OUTPUT_DIR="my_model"

accelerate launch train_dreambooth.py
--pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse"
--pretrained_model_name_or_path=$MODEL_NAME
--instance_data_dir=$INSTANCE_DIR
--output_dir=$OUTPUT_DIR
--train_text_encoder
--instance_prompt="laarretaa"
--resolution=512
--train_batch_size=1
--learning_rate=1e-6
--lr_scheduler="constant"
--lr_warmup_steps=0
--gradient_accumulation_steps=2 --gradient_checkpointing
--use_8bit_adam
--save_interval=500
--max_train_steps=4500

ZeroCool22 avatar Feb 17 '23 16:02 ZeroCool22

I have this issue with nerdy rodents guide on oobabooga's text-generation-webui with one-click installer on gtx 1080ti in windows. Bitsandbytes cannot find cuda. What is the solution there? Can I add that line somewhere?

MikkoHaavisto avatar Mar 14 '23 03:03 MikkoHaavisto

I have this issue with nerdy rodents guide on oobabooga's text-generation-webui with one-click installer on gtx 1080ti in windows. Bitsandbytes cannot find cuda. What is the solution there? Can I add that line somewhere?

See this post https://github.com/oobabooga/text-generation-webui/issues/20#issuecomment-1411650652 :)

adamsanders avatar Mar 17 '23 03:03 adamsanders

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

port LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH

Correct, this was the main cause, not the CUDA version. The export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH need to be in the config of the train file. Even if you reboot, it will still not find CUDA if that line is not added. But in your video you say, "reboot or add this line". So ppl take that as if you restart not need to add that line, but the line must be added permanent in the config.

Sorry to bother, but for us tech newbies, how does one do that?

In your train file:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH export MODEL_NAME="darkstorm2150/Protogen_x3.4_Official_Release" export INSTANCE_DIR="training" export OUTPUT_DIR="my_model"

accelerate launch train_dreambooth.py --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" --pretrained_model_name_or_path=$MODEL_NAME --instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR --train_text_encoder --instance_prompt="laarretaa" --resolution=512 --train_batch_size=1 --learning_rate=1e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --gradient_accumulation_steps=2 --gradient_checkpointing --use_8bit_adam --save_interval=500 --max_train_steps=4500

Hi, I got the same error but I don't have the folder "/usr/lib/wsl", could you tell me what the problem might be? Much appreciated!

caizhuoyue77 avatar Feb 21 '24 08:02 caizhuoyue77