Dreambooth icon indicating copy to clipboard operation
Dreambooth copied to clipboard

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.7 and torchvision has CUDA Versio=11.6. Please reinstall the torchvision that matches your PyTorch install.

Open G-force78 opened this issue 2 years ago • 12 comments
trafficstars

When launching training

Seems to be an error everywhere with this so not specific to this repo. Any ideas how to fix?

G-force78 avatar Jan 12 '23 13:01 G-force78

Where are you running the script? If you are using the notebook, does the error occur when you launch the training? Or somewhere before?

I only have access to Google Colab, where the CUDA versions seems to match:

Description: Ubuntu 18.04.6 LTS diffusers==0.11.1 torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.0%2Bcu116-cp38-cp38-linux_x86_64.whl transformers==4.25.1 xformers @ https://github.com/brian6091/xformers-wheels/releases/download/0.0.15.dev0%2B4c06c79/xformers-0.0.15.dev0+4c06c79.d20221205-cp38-cp38-linux_x86_64.whl

Copy-and-paste the text below in your GitHub issue

  • Accelerate version: 0.15.0
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.27
  • Python version: 3.8.16
  • Numpy version: 1.21.6
  • PyTorch version (GPU?): 1.13.0+cu116 (True)

brian6091 avatar Jan 12 '23 23:01 brian6091

Thats odd..Yeah happens when the actual training cell is launched, maybe I have an outdated notebook will try the recent one. very nice notebook to use by the way.

G-force78 avatar Jan 13 '23 10:01 G-force78

Ok, I actually haven't tried the notebook on the main branch for awhile. I will test tonight. Thanks for reporting.

brian6091 avatar Jan 13 '23 13:01 brian6091

i think it needs updating and tweaking Im getting error after error from the training cell, nothing seems to be linked back to the previous cells where the parameters are chosen

G-force78 avatar Jan 13 '23 13:01 G-force78

Are you referring to the notebook on the main branch?

brian6091 avatar Jan 13 '23 15:01 brian6091

Yes https://colab.research.google.com/github/brian6091/Dreambooth/blob/main/FineTuning_colab.ipynb

G-force78 avatar Jan 14 '23 09:01 G-force78

Ok thanks. I'll have a look today.

brian6091 avatar Jan 14 '23 09:01 brian6091

So I've fixed a couple of things and checked that the dependencies are all ok (at least on Google Colab). Please try the Notebook linked below. Two things:

  1. I maintain this version on a different branch (https://github.com/brian6091/Dreambooth/tree/v0.0.2), so keep that version in mind since I will pull in >800 commits to main this weekend.

  2. You need to run all the cells in sequence so that all the parameters are defined in the workspace. Skipping anything (except the tensorboard visualization cell) will cause an error.

Open In Colab

brian6091 avatar Jan 14 '23 16:01 brian6091

Ok thanks, will give it a go

G-force78 avatar Jan 15 '23 09:01 G-force78

For some reason got an out of memory error, although fp16 and 8bit adam are enabled, as is gradient checkpointing.

Generating samples:   0% 0/4 [00:15<?, ?it/s]
Traceback (most recent call last):
  File "/content/Dreambooth/train.py", line 1110, in <module>
    main(args)
  File "/content/Dreambooth/train.py", line 1070, in main
    save_weights(global_step)
  File "/content/Dreambooth/train.py", line 977, in save_weights
    images = pipeline(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 546, in __call__
    image = self.decode_latents(latents)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 341, in decode_latents
    image = self.vae.decode(latents).sample
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 605, in decode
    decoded = self._decode(z).sample
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 577, in _decode
    dec = self.decoder(z)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 217, in forward
    sample = up_block(sample)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 1691, in forward
    hidden_states = resnet(hidden_states, temb=None)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py", line 457, in forward
    hidden_states = self.norm1(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/normalization.py", line 273, in forward
    return F.group_norm(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2528, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.76 GiB total capacity; 12.85 GiB already allocated; 397.75 MiB free; 13.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:  33% 401/1200 [07:47<15:30,  1.16s/it, Loss/pred=0.0148, lr/text=3.75e-5, lr/unet=1.5e-6]

G-force78 avatar Jan 15 '23 10:01 G-force78

Are train_batch_size and sample_batch_size both equal to 1? Can you post the args.json output here (it will be in your output_dir). It OOMed at a weird step, so I'm not sure.

brian6091 avatar Jan 15 '23 11:01 brian6091

They were yes, I had already deleted runtime by the time I had seen this so lost my output dir

G-force78 avatar Jan 16 '23 12:01 G-force78