bitsandbytes Lora training fails despite python -m bitsandbytes all positive

Lora training fails despite python -m bitsandbytes all positive

Open Sturmkater opened this issue 1 year ago • 1 comments
System Info

Distributor ID: Pop Description: Pop!_OS 22.04 LTS Release: 22.04 Codename: jammy
running in proxmox with RTX 4070 ti passtrow.
Reproduction

Trying to train Lora, always ends here.
Warning: LD_LIBRARY_PATH environment variable is not set.
Certain functionalities may not work correctly.
Please ensure that the required libraries are properly configured.
 
If you use WSL2 you may want to: export LD_LIBRARY_PATH=/usr/lib/wsl/lib/
 
09:41:58-147913 INFO     Version: v22.5.0                                       
                                                                                
09:41:58-150370 INFO     nVidia toolkit detected                                
09:41:59-283380 INFO     Torch 2.0.1+cu118                                      
09:41:59-290575 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700             
09:41:59-300734 INFO     Torch detected GPU: NVIDIA GeForce RTX 4070 Ti VRAM    
                         12010 Arch (8, 9) Cores 60                             
09:41:59-301351 INFO     Verifying modules installation status from             
                         /home/ai/kohya/requirements_linux.txt...               
09:41:59-302849 INFO     Verifying modules installation status from             
                         requirements.txt...                                    
09:42:00-810398 INFO     headless: False                                        
09:42:00-812071 INFO     Load CSS...                                            
Running on local URL:  http://10.10.1.5:7861

To create a public link, set share=True in launch().
09:42:25-109421 INFO     Loading config...                                      
09:42:36-427946 INFO     Start training LoRA Standard ...                       
09:42:36-428482 INFO     Checking for duplicate image filenames in training data
                         directory...                                           
09:42:36-440140 INFO     Valid image folder names found in:                     
                         /home/ai/kohya/TRAINING/images                         
09:42:36-441540 INFO     Valid image folder names found in:                     
                         /home/ai/kohya/TRAINING/regularization                 
09:42:36-443066 INFO     Folder 100_catalyst: 617 images found                  
09:42:36-443723 INFO     Folder 100_catalyst: 61700 steps                       
09:42:36-444252 WARNING  Regularisation images are used... Will double the      
                         number of steps required...                            
09:42:36-444714 INFO     Total steps: 61700                                     
09:42:36-445040 INFO     Train batch size: 2                                    
09:42:36-445354 INFO     Gradient accumulation steps: 1                         
09:42:36-445680 INFO     Epoch: 1                                               
09:42:36-445989 INFO     Regulatization factor: 2                               
09:42:36-446302 INFO     max_train_steps (61700 / 2 / 1 * 1 * 2) = 61700        
09:42:36-446686 INFO     stop_text_encoder_training = 0                         
09:42:36-447021 INFO     lr_warmup_steps = 0                                    
09:42:36-447388 INFO     Saving training config to                              
                         /home/ai/kohya/TRAINING/model/catalyst_0.1_20240122-094
                         236.json...                                            
09:42:36-448452 INFO     accelerate launch --num_cpu_threads_per_process=2      
                         "./train_network.py"                                   
                         --pretrained_model_name_or_path="runwayml/stable-diffus
                         ion-v1-5"                                              
                         --train_data_dir="/home/ai/kohya/TRAINING/images"      
                         --reg_data_dir="/home/ai/kohya/TRAINING/regularization"
                         --resolution="768,768"                                 
                         --output_dir="/home/ai/kohya/TRAINING/model"           
                         --logging_dir="/home/ai/kohya/TRAINING/log"
--network_alpha="1" --save_model_as=safetensors        
                         --network_module=networks.lora --network_dim=8         
                         --output_name="catalyst_0.1"                           
                         --lr_scheduler_num_cycles="1" --learning_rate="0.0001" 
                         --lr_scheduler="constant" --train_batch_size="2"       
                         --max_train_steps="61700" --save_every_n_epochs="1"    
                         --mixed_precision="bf16" --save_precision="bf16"       
                         --seed="1234" --caption_extension=".txt"               
                         --cache_latents --optimizer_type="AdamW8bit"           
                         --max_grad_norm="1" --max_data_loader_n_workers="1"    
                         --clip_skip=2 --bucket_reso_steps=64 --xformers        
                         --bucket_no_upscale --noise_offset=0.0                 
The following values were not passed to accelerate launch and had defaults used instead:
  --num_processes was set to a value of 1
  --num_machines was set to a value of 1
  --mixed_precision was set to a value of 'no'
  --dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
2024-01-22 09:42:38.744675: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-01-22 09:42:38.908373: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-22 09:42:38.908402: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-22 09:42:38.909467: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-22 09:42:38.983436: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 09:42:39.660678: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
prepare tokenizer
Using DreamBooth method.
prepare images.
found directory /home/ai/kohya/TRAINING/images/100_catalyst contains 617 image files
found directory /home/ai/kohya/TRAINING/regularization/1_illustration style contains 1000 image files
No caption file found for 1000 images. Training will continue without captions for these images. If class token exists, it will be used. / 1000枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。
/home/ai/kohya/TRAINING/regularization/1_illustration style/00000-172513325-illustration style.png
/home/ai/kohya/TRAINING/regularization/1_illustration style/00001-172513326-illustration style.png
/home/ai/kohya/TRAINING/regularization/1_illustration style/00002-172513327-illustration style.png
/home/ai/kohya/TRAINING/regularization/1_illustration style/00003-172513328-illustration style.png
/home/ai/kohya/TRAINING/regularization/1_illustration style/00004-172513329-illustration style.png
/home/ai/kohya/TRAINING/regularization/1_illustration style/00005-172513330-illustration style.png... and 995 more
61700 train images with repeating.
1000 reg images.
[Dataset 0]
  batch_size: 2
  resolution: (768, 768)
  enable_bucket: False[Subset 0 of Dataset 0]
    image_dir: "/home/ai/kohya/TRAINING/images/100_catalyst"
    image_count: 617
    num_repeats: 100
    shuffle_caption: False
    keep_tokens: 0
    keep_tokens_separator: 
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: catalyst
    caption_extension: .txt

  [Subset 1 of Dataset 0]
    image_dir: "/home/ai/kohya/TRAINING/regularization/1_illustration style"
    image_count: 1000
    num_repeats: 1
    shuffle_caption: False
    keep_tokens: 0
    keep_tokens_separator: 
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: True
    class_tokens: illustration style
    caption_extension: .txt


[Dataset 0]
loading image sizes.
100%|█████████████████████████████████████| 1617/1617 [00:00<00:00, 3574.95it/s]
prepare dataset
preparing accelerator
loading model for process 0/1
load Diffusers pretrained models: runwayml/stable-diffusion-v1-5
Loading pipeline components...: 100%|█████████████| 5/5 [00:00<00:00, 11.09it/s]
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
UNet2DConditionModel: 64, 8, 768, False, False
U-Net converted to original U-Net
Enable xformers for U-Net
import network module: networks.lora
[Dataset 0]
caching latents.
checking cache validity...
100%|██████████████████████████████████| 1617/1617 [00:00<00:00, 3259101.19it/s]
caching latents...
100%|███████████████████████████████████████| 1617/1617 [02:43<00:00,  9.91it/s]
create LoRA network. base dim (rank): 8, alpha: 1.0
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder:
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
prepare optimizer, data loader etc.
False

===================================BUG REPORT===================================
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

nspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
Traceback (most recent call last):
  File "/home/ai/kohya/venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ai/kohya/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ai/kohya/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command
    simple_launcher(args)
  File "/home/ai/kohya/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ai/kohya/venv/bin/python', './train_network.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=/home/ai/kohya/TRAINING/images', '--reg_data_dir=/home/ai/kohya/TRAINING/regularization', '--resolution=768,768', '--output_dir=/home/ai/kohya/TRAINING/model', '--logging_dir=/home/ai/kohya/TRAINING/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--network_dim=8', '--output_name=catalyst_0.1', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=61700', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 1.```

and the output from the bitsandbites command: 

```++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
/home/ai/kohya/venv/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime_providers_cuda.so
/home/ai/kohya/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/ai/kohya/venv/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/ai/kohya/venv/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda122.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda122_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ai/kohya/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115.so

++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['8.9']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!```

### Expected behavior

Training a Lora
Jan 22 '24 09:01 Sturmkater
i met the same error
Jan 23 '24 18:01 2575044704
bitsandbytes bitsandbytes copied to clipboard

Lora training fails despite python -m bitsandbytes all positive

System Info

Reproduction

bitsandbytes
bitsandbytes copied to clipboard