Dreambooth
Dreambooth copied to clipboard
RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.7 and torchvision has CUDA Versio=11.6. Please reinstall the torchvision that matches your PyTorch install.
When launching training
Seems to be an error everywhere with this so not specific to this repo. Any ideas how to fix?
Where are you running the script? If you are using the notebook, does the error occur when you launch the training? Or somewhere before?
I only have access to Google Colab, where the CUDA versions seems to match:
Description: Ubuntu 18.04.6 LTS diffusers==0.11.1 torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.0%2Bcu116-cp38-cp38-linux_x86_64.whl transformers==4.25.1 xformers @ https://github.com/brian6091/xformers-wheels/releases/download/0.0.15.dev0%2B4c06c79/xformers-0.0.15.dev0+4c06c79.d20221205-cp38-cp38-linux_x86_64.whl
Copy-and-paste the text below in your GitHub issue
Accelerateversion: 0.15.0- Platform: Linux-5.10.147+-x86_64-with-glibc2.27
- Python version: 3.8.16
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.13.0+cu116 (True)
Thats odd..Yeah happens when the actual training cell is launched, maybe I have an outdated notebook will try the recent one. very nice notebook to use by the way.
Ok, I actually haven't tried the notebook on the main branch for awhile. I will test tonight. Thanks for reporting.
i think it needs updating and tweaking Im getting error after error from the training cell, nothing seems to be linked back to the previous cells where the parameters are chosen
Are you referring to the notebook on the main branch?
Yes https://colab.research.google.com/github/brian6091/Dreambooth/blob/main/FineTuning_colab.ipynb
Ok thanks. I'll have a look today.
So I've fixed a couple of things and checked that the dependencies are all ok (at least on Google Colab). Please try the Notebook linked below. Two things:
-
I maintain this version on a different branch (https://github.com/brian6091/Dreambooth/tree/v0.0.2), so keep that version in mind since I will pull in >800 commits to main this weekend.
-
You need to run all the cells in sequence so that all the parameters are defined in the workspace. Skipping anything (except the tensorboard visualization cell) will cause an error.
Ok thanks, will give it a go
For some reason got an out of memory error, although fp16 and 8bit adam are enabled, as is gradient checkpointing.
Generating samples: 0% 0/4 [00:15<?, ?it/s]
Traceback (most recent call last):
File "/content/Dreambooth/train.py", line 1110, in <module>
main(args)
File "/content/Dreambooth/train.py", line 1070, in main
save_weights(global_step)
File "/content/Dreambooth/train.py", line 977, in save_weights
images = pipeline(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 546, in __call__
image = self.decode_latents(latents)
File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 341, in decode_latents
image = self.vae.decode(latents).sample
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 605, in decode
decoded = self._decode(z).sample
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 577, in _decode
dec = self.decoder(z)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 217, in forward
sample = up_block(sample)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 1691, in forward
hidden_states = resnet(hidden_states, temb=None)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py", line 457, in forward
hidden_states = self.norm1(hidden_states)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2528, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.76 GiB total capacity; 12.85 GiB already allocated; 397.75 MiB free; 13.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps: 33% 401/1200 [07:47<15:30, 1.16s/it, Loss/pred=0.0148, lr/text=3.75e-5, lr/unet=1.5e-6]
Are train_batch_size and sample_batch_size both equal to 1? Can you post the args.json output here (it will be in your output_dir). It OOMed at a weird step, so I'm not sure.
They were yes, I had already deleted runtime by the time I had seen this so lost my output dir