kohya_ss
kohya_ss copied to clipboard
Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error
Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error
Introduction
I encountered an issue while training LoRA models on my NVIDIA GeForce GTX 1660 6GB card. The training script terminated unexpectedly, reporting a "NaN detected in latents" error. This issue seems to prevent successful model training using this specific GPU setup.
Environment Details
- Operating System: Windows 10
-
Stable Diffusion Interface Installation Source: GitHub repository, with a minor modification in
requirements.txt
to usegradio==3.44.0
. - GPU: NVIDIA GeForce GTX 1660 6GB
Steps to Reproduce
- Set up the environment as per the instructions in the Kohya_ss-GUI-LoRA-Portable GitHub repository, with the mentioned modification in
requirements.txt
. - Attempt to train a LoRA model using the provided training script.
- Observe the process termination with the error: "NaN detected in latents".
Expected Behavior
The training process should run without encountering NaN errors in latents, allowing for successful model training.
Actual Behavior
The training process fails early with a RuntimeError indicating "NaN detected in latents", specifically pointing to a problematic image file. This suggests an issue in handling certain data types or values during the training phase.
[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 2991.44it/s] prepare dataset preparing accelerator loading model for process 0/1 load StableDiffusion checkpoint: C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> loading text encoder: <All keys matched successfully> Enable xformers for U-Net A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:00<?, ?it/s] caching latents... 0%| | 0/30 [00:02<?, ?it/s] Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 1033, in <module> trainer.train(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 267, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 1927, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 952, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 2272, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: C:\TrainDataDir\100_Name\aehrthgaerthg.png Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\accelerate.exe_main.py", line 7, in <module> File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\python.exe', './train_network.py', '--pretrained_model_name_or_path=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors', '--train_data_dir=C:\TrainDataDir', '--resolution=512,512', '--output_dir=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Lora', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=128', '--output_name=Name', '--lr_scheduler_num_cycles=1', '--learning_rate=8e-05', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=150', '--train_batch_size=2', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--wandb_api_key=False']' returned non-zero exit status 1.
Solution/Workaround
After researching similar issues https://github.com/kohya-ss/sd-scripts/issues/293 and experimenting with potential fixes, I found that enabling torch.backends.cudnn.benchmark
resolved the problem. This change likely optimizes the selection of the best convolution algorithm for the environment, thus avoiding the conditions that lead to NaN values being generated. Here's the specific code modification made in train_network.py
:
# Added to the body of the method def train(self, args):
torch.backends.cudnn.benchmark = True
# This line follows the existing code in the method, for example:
session_id = random.randint(0, 2**32)
Suggestion for Permanent Fix
It appears that enabling torch.backends.cudnn.benchmark
can prevent the NaN error during training on specific hardware setups, such as the NVIDIA GTX 1660 6GB. It would be beneficial for the training script to automatically detect when this setting is necessary or to document this workaround for users with similar hardware configurations.
Conclusion
Incorporating the torch.backends.cudnn.benchmark = True
setting into the training script resolved the "NaN detected in latents" error on an NVIDIA GTX 1660 6GB card, facilitating successful training of LoRA models. This workaround might be helpful for others experiencing similar issues, and a permanent fix or documentation update could further improve user experience.
I use the same video card. when I start gui.bat the following error always appears: ''ImportError: cannot import name 'set_documentation_group' from 'gradio_client.documentation' I've already followed all the steps, this part is the only one I can't do, do you know what is this?
I use the same video card. when I start gui.bat the following error always appears: ''ImportError: cannot import name 'set_documentation_group' from 'gradio_client.documentation' I've already followed all the steps, this part is the only one I can't do, do you know what is this?
Good afternoon. Did you correct the line "gradio" to "gradio==3.44.0" in the root file "requirements.txt"?
Not sure if related, I'm on a different GPU but still getting only NaNs: at about 150-250th step the loss shoots up and quickly becomes avr_loss=nan
😕
Not sure if related, I'm on a different GPU but still getting only NaNs: at about 150-250th step the loss shoots up and quickly becomes
avr_loss=nan
😕
I spent 3000 steps, model output successfully, and everything works.
https://github.com/bmaltais/kohya_ss/discussions/1947#discussioncomment-8819458