sd-scripts ⛔[SD3 branch] NaN when using full FP16 with custom optimizers

Problem: When using fp16 and full_fp16 in combination with custom optimizers from the https://github.com/kozistr/pytorch_optimizer library, NaNs appear on the second training step. This issue also occurs when the VAE is specified directly via --vae.

Solution: Use bf16 instead. Previously, the issue was likely resolved by explicitly setting --no_half_vae, but currently, this no longer prevents NaNs - they still occur even with --no_half_vae.

NaNs likely do not occur when using 8-bit AdamW together with fp16/full_fp16 (I ran such a setup last week and will confirm the details later).

There are no command-line errors, the training simply results in NaNs. The error occurs even when the configuration is simplified as much as possible.

Environment: PyTorch 2.7.0, CUDA 12.8 (tested on 11.8 and 12.4 also), xFormers, Triton

Last test config:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^
	--pretrained_model_name_or_path="K:/model.safetensors" ^
	--train_data_dir="data" ^
	--output_dir="output_dir" ^
	--output_name="btest" ^
	--network_args "algo=locon" "conv_dim=32" "conv_alpha=32" "preset=full" "train_norm=False" "dora_wd=True" ^
	--resolution="768" ^
	--save_model_as="safetensors" ^
	--network_module="lycoris.kohya" ^
	--shuffle_caption ^
	--max_train_epochs=100 ^
	--save_every_n_epochs=1 ^
	--save_state_on_train_end ^
	--save_precision=fp16 ^
	--network_dim=32 ^
	--network_alpha=32 ^
	--train_batch_size=2 ^
	--gradient_accumulation_steps=2 ^
	--max_data_loader_n_workers=1 ^
	--enable_bucket ^
	--bucket_reso_steps=64 ^
	--min_bucket_reso=768 ^
	--max_bucket_reso=1280 ^
	--mixed_precision="fp16" ^
	--caption_extension=".txt" ^
	--gradient_checkpointing ^
	--optimizer_type="pytorch_optimizer.optimizer.amos.Amos" ^
	--learning_rate=0.00001 ^
	--network_train_unet_only ^
	--loss_type="huber" ^
	--huber_schedule="snr" ^
	--huber_c=0.1 ^
	--xformers ^
	--full_fp16 ^
	--mem_eff_attn ^
	--seed=1 ^
	--logging_dir="logs" ^
	--log_with="tensorboard" ^
	--persistent_data_loader_workers ^`

Jun 08 '25 10:06 deGENERATIVE-SQUAD

same problem, some optimizers and schedulers seem to have problems when calculating gradients

Jun 08 '25 11:06 mliand

full_fp16 training is not stable, so full_bf16 is better. In addition, full_bf16 is also not recommened if VRAM is enough for standard (mixed precision without full_bf16).

Jun 08 '25 12:06 kohya-ss

@kohya-ss

full_fp16 training is not stable

Yes, I know that fp16 is unstable - it’s just that I didn’t encounter NaNs this frequently before, so I wanted to clarify whether this is normal behavior or a bug. What especially puzzled me was that --no_half_vae used to fix this "bug" before.

so full_bf16 is better.

I agree, I mostly train using Prodigy Schedule Free, which is optimized for bf16. But there’s a nuance: SDXL was trained in fp16, and many people believe it’s best to fine-tune using the same precision as the base model.

In any case, thanks for the clarification.

Jun 08 '25 14:06 deGENERATIVE-SQUAD