Training LoRA models on NVIDIA GTX 1660 6GB Fails with "NaN detected in latents" Error

Introduction

I encountered an issue while training LoRA models on my NVIDIA GeForce GTX 1660 6GB card. The training script terminated unexpectedly, reporting a "NaN detected in latents" error. This issue seems to prevent successful model training using this specific GPU setup.

Environment Details

Operating System: Windows 10
Stable Diffusion Interface Installation Source: GitHub repository, with a minor modification in requirements.txt to use gradio==3.44.0.
GPU: NVIDIA GeForce GTX 1660 6GB

Steps to Reproduce

Set up the environment as per the instructions in the Kohya_ss-GUI-LoRA-Portable GitHub repository, with the mentioned modification in requirements.txt.
Attempt to train a LoRA model using the provided training script.
Observe the process termination with the error: "NaN detected in latents".

Expected Behavior

The training process should run without encountering NaN errors in latents, allowing for successful model training.

Actual Behavior

The training process fails early with a RuntimeError indicating "NaN detected in latents", specifically pointing to a problematic image file. This suggests an issue in handling certain data types or values during the training phase. [Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 2991.44it/s] prepare dataset preparing accelerator loading model for process 0/1 load StableDiffusion checkpoint: C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: <All keys matched successfully> loading vae: <All keys matched successfully> loading text encoder: <All keys matched successfully> Enable xformers for U-Net A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' import network module: networks.lora [Dataset 0] caching latents. checking cache validity... 100%|████████████████████████████████████████████████████████████████████| 30/30 [00:00<?, ?it/s] caching latents... 0%| | 0/30 [00:02<?, ?it/s] Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 1033, in <module> trainer.train(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\train_network.py", line 267, in train train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 1927, in cache_latents dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 952, in cache_latents cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop) File "C:\Kohya_ss-GUI-LoRA-Portable-main\library\train_util.py", line 2272, in cache_batch_latents raise RuntimeError(f"NaN detected in latents: {info.absolute_path}") RuntimeError: NaN detected in latents: C:\TrainDataDir\100_Name\aehrthgaerthg.png Traceback (most recent call last): File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Kohya_ss-GUI-LoRA-Portable-main\python\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\accelerate.exe_main.py", line 7, in <module> File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main args.func(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 1017, in launch_command simple_launcher(args) File "C:\Kohya_ss-GUI-LoRA-Portable-main\venv\lib\site-packages\accelerate\commands\launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Kohya_ss-GUI-LoRA-Portable-main\venv\Scripts\python.exe', './train_network.py', '--pretrained_model_name_or_path=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Stable-diffusion\Reliberate_v3.safetensors', '--train_data_dir=C:\TrainDataDir', '--resolution=512,512', '--output_dir=C:\stable-diffusion-portable-main\stable-diffusion-portable-main\models\Lora', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-05', '--unet_lr=0.0001', '--network_dim=128', '--output_name=Name', '--lr_scheduler_num_cycles=1', '--learning_rate=8e-05', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=150', '--train_batch_size=2', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1', '--cache_latents', '--optimizer_type=AdamW8bit', '--max_grad_norm=1', '--max_data_loader_n_workers=1', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0', '--wandb_api_key=False']' returned non-zero exit status 1.

Solution/Workaround

After researching similar issues https://github.com/kohya-ss/sd-scripts/issues/293 and experimenting with potential fixes, I found that enabling torch.backends.cudnn.benchmark resolved the problem. This change likely optimizes the selection of the best convolution algorithm for the environment, thus avoiding the conditions that lead to NaN values being generated. Here's the specific code modification made in train_network.py:

# Added to the body of the method def train(self, args):
torch.backends.cudnn.benchmark = True
# This line follows the existing code in the method, for example:
session_id = random.randint(0, 2**32)

Suggestion for Permanent Fix

It appears that enabling torch.backends.cudnn.benchmark can prevent the NaN error during training on specific hardware setups, such as the NVIDIA GTX 1660 6GB. It would be beneficial for the training script to automatically detect when this setting is necessary or to document this workaround for users with similar hardware configurations.

Conclusion

Incorporating the torch.backends.cudnn.benchmark = True setting into the training script resolved the "NaN detected in latents" error on an NVIDIA GTX 1660 6GB card, facilitating successful training of LoRA models. This workaround might be helpful for others experiencing similar issues, and a permanent fix or documentation update could further improve user experience.

Feb 09 '24 09:02 kferterb

I use the same video card. when I start gui.bat the following error always appears: ''ImportError: cannot import name 'set_documentation_group' from 'gradio_client.documentation' I've already followed all the steps, this part is the only one I can't do, do you know what is this?

Feb 09 '24 13:02 waifuista

I use the same video card. when I start gui.bat the following error always appears: ''ImportError: cannot import name 'set_documentation_group' from 'gradio_client.documentation' I've already followed all the steps, this part is the only one I can't do, do you know what is this?

Good afternoon. Did you correct the line "gradio" to "gradio==3.44.0" in the root file "requirements.txt"?

Feb 09 '24 15:02 kferterb

Not sure if related, I'm on a different GPU but still getting only NaNs: at about 150-250th step the loss shoots up and quickly becomes avr_loss=nan 😕

Feb 09 '24 20:02 Zueuk

Not sure if related, I'm on a different GPU but still getting only NaNs: at about 150-250th step the loss shoots up and quickly becomes avr_loss=nan 😕

I spent 3000 steps, model output successfully, and everything works.

Feb 10 '24 08:02 kferterb

https://github.com/bmaltais/kohya_ss/discussions/1947#discussioncomment-8819458

Mar 18 '24 15:03 kferterb

kohya_ss kohya_ss copied to clipboard