kohya_ss icon indicating copy to clipboard operation
kohya_ss copied to clipboard

Getting error when training Lora (torch._dynamo?)

Open boxnum02 opened this issue 2 years ago • 5 comments

Have been spending hours on this problem and could not figure out how to fix this. I tried all the steps listed on this issue: https://github.com/bmaltais/kohya_ss/issues/192 But it still does not work. I suspect it is related to the PyTorch. I did install PyTorch but one of the sentence in the code still says : ModuleNotFoundError: No module named 'torch._dynamo'

Can any body please help? Thanks in advance.

max_train_steps = 6000 stop_text_encoder_training = 0 lr_warmup_steps = 0 accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith/image" --resolution=512,512 --output_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith" --logging_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith" --save_model_as=safetensors --output_name="Wraith" --max_data_loader_n_workers="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="6000" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --xformers --bucket_no_upscale prepare tokenizer prepare train images. found directory 100_Wraith contains 120 image files 12000 train images with repeating. loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 120/120 [00:00<00:00, 1512.90it/s] prepare dataset prepare accelerator Using accelerator 0.15.0 or above. load Diffusers pretrained models text_encoder\model.safetensors not found Fetching 19 files: 100%|███████████████████████████████████████████████████████████████████████| 19/19 [00:00<?, ?it/s] C:\Users\Jim\kohya_ss\venv\lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead. warnings.warn( You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . Replace CrossAttention.forward to use xformers caching latents. 100%|████████████████████████████████████████████████████████████████████████████████| 120/120 [00:08<00:00, 13.51it/s] prepare optimizer, data loader etc. use AdamW optimizer | {} Traceback (most recent call last): File "C:\Users\Jim\kohya_ss\train_db.py", line 346, in train(args) File "C:\Users\Jim\kohya_ss\train_db.py", line 153, in train unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare( File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 876, in prepare result = tuple( File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 877, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 741, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 914, in prepare_model import torch._dynamo as dynamo ModuleNotFoundError: No module named 'torch._dynamo' Traceback (most recent call last): File "C:\Users\Jim\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Jim\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Users\Jim\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command simple_launcher(args) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Jim\kohya_ss\venv\Scripts\python.exe', 'train_db.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=C:/Users/Jim/Downloads/Wraith/512/Wraith/image', '--resolution=512,512', '--output_dir=C:/Users/Jim/Downloads/Wraith/512/Wraith', '--logging_dir=C:/Users/Jim/Downloads/Wraith/512/Wraith', '--save_model_as=safetensors', '--output_name=Wraith', '--max_data_loader_n_workers=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=6000', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

boxnum02 avatar Feb 24 '23 16:02 boxnum02

Not sure if it will make a difference but it looks like your folder structures aren't setup the suggested way. -train_data_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith/image" output_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith" this should be C:/Users/Jim/Downloads/Wraith/512/Wraith/model --logging_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith" this should be C:/Users/Jim/Downloads/Wraith/512/Wraith/log

Also try checking/unchecking the 8bitadam box. GL

SaltySkegg avatar Feb 24 '23 21:02 SaltySkegg

Be sure that "Do you wish to optimize your script with torch dynamo?" is set as 'no' when configuring accelerate. Unless you need it enabled for some reason.

and1011 avatar Feb 24 '23 22:02 and1011

Not sure if it will make a difference but it looks like your folder structures aren't setup the suggested way. -train_data_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith/image" output_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith" this should be C:/Users/Jim/Downloads/Wraith/512/Wraith/model --logging_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith" this should be C:/Users/Jim/Downloads/Wraith/512/Wraith/log

Also try checking/unchecking the 8bitadam box. GL

Thank you for the reply! I did the right directory and tried using or uncheck ing 8bit adam. Sadly the same error comes up.

boxnum02 avatar Feb 25 '23 02:02 boxnum02

Be sure that "Do you wish to optimize your script with torch dynamo?" is set as 'no' when configuring accelerate. Unless you need it enabled for some reason.

Thank you for the reply! Where do i have that option? I dont get what is configuring accelerate.

boxnum02 avatar Feb 25 '23 02:02 boxnum02

Do you wish to optimize your script with torch dynamo

Upon searching on the topic I realize you are talking about configuring accelerate package. I did that and chose No for optimize script with torch dynamo. That did give me some progress but another problem arises. I think Torch is reserving all of my VRAM( I have 12 gb) and left none for the training. How can I avoid this?

Folder 100_Wraith: 12000 steps max_train_steps = 6000 stop_text_encoder_training = 0 lr_warmup_steps = 0 accelerate launch --num_cpu_threads_per_process=2 "train_db.py" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith/image" --resolution=512,512 --output_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith/model" --logging_dir="C:/Users/Jim/Downloads/Wraith/512/Wraith/log" --save_model_as=safetensors --output_name="Wraith" --max_data_loader_n_workers="1" --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="6000" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --caption_extension=".txt" --cache_latents --optimizer_type="AdamW" --max_data_loader_n_workers="1" --clip_skip=2 --bucket_reso_steps=64 --xformers --bucket_no_upscale prepare tokenizer prepare train images. found directory 100_Wraith contains 120 image files 12000 train images with repeating. loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████| 120/120 [00:00<00:00, 9999.93it/s] prepare dataset prepare accelerator Using accelerator 0.15.0 or above. load Diffusers pretrained models text_encoder\model.safetensors not found Fetching 19 files: 100%|████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 19001.38it/s] C:\Users\Jim\kohya_ss\venv\lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead. warnings.warn( You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing safety_checker=None. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 . Replace CrossAttention.forward to use xformers caching latents. 100%|████████████████████████████████████████████████████████████████████████████████| 120/120 [00:06<00:00, 18.68it/s] prepare optimizer, data loader etc. use AdamW optimizer | {} running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 12000 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 6000 num epochs / epoch数: 1 batch size per device / バッチサイズ: 2 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 2 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 6000 steps: 0%| | 0/6000 [00:00<?, ?it/s]epoch 1/1 Traceback (most recent call last): File "C:\Users\Jim\kohya_ss\train_db.py", line 346, in train(args) File "C:\Users\Jim\kohya_ss\train_db.py", line 272, in train optimizer.step() File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py", line 134, in step self.scaler.step(self.optimizer, closure) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 338, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 285, in _maybe_opt_step retval = optimizer.step(*args, **kwargs) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py", line 65, in wrapper return wrapped(*args, **kwargs) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py", line 113, in wrapper return func(*args, **kwargs) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\torch\optim\adamw.py", line 146, in step state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format) RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 11.99 GiB total capacity; 10.89 GiB already allocated; 0 bytes free; 11.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 0/6000 [00:04<?, ?it/s] Traceback (most recent call last): File "C:\Users\Jim\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Jim\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Users\Jim\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command simple_launcher(args) File "C:\Users\Jim\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\Jim\kohya_ss\venv\Scripts\python.exe', 'train_db.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--train_data_dir=C:/Users/Jim/Downloads/Wraith/512/Wraith/image', '--resolution=512,512', '--output_dir=C:/Users/Jim/Downloads/Wraith/512/Wraith/model', '--logging_dir=C:/Users/Jim/Downloads/Wraith/512/Wraith/log', '--save_model_as=safetensors', '--output_name=Wraith', '--max_data_loader_n_workers=1', '--learning_rate=0.0001', '--lr_scheduler=constant', '--train_batch_size=2', '--max_train_steps=6000', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW', '--max_data_loader_n_workers=1', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

boxnum02 avatar Feb 25 '23 03:02 boxnum02

I'm having the same problem as yours, did you figure out how to fix it?

ctimict avatar Apr 21 '23 11:04 ctimict