⛔[SD3 branch] NaN when using full FP16 with custom optimizers
Problem: When using fp16 and full_fp16 in combination with custom optimizers from the https://github.com/kozistr/pytorch_optimizer library, NaNs appear on the second training step. This issue also occurs when the VAE is specified directly via --vae.
Solution: Use bf16 instead. Previously, the issue was likely resolved by explicitly setting --no_half_vae, but currently, this no longer prevents NaNs - they still occur even with --no_half_vae.
NaNs likely do not occur when using 8-bit AdamW together with fp16/full_fp16 (I ran such a setup last week and will confirm the details later).
There are no command-line errors, the training simply results in NaNs. The error occurs even when the configuration is simplified as much as possible.
Environment: PyTorch 2.7.0, CUDA 12.8 (tested on 11.8 and 12.4 also), xFormers, Triton
Last test config:
accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^
--pretrained_model_name_or_path="K:/model.safetensors" ^
--train_data_dir="data" ^
--output_dir="output_dir" ^
--output_name="btest" ^
--network_args "algo=locon" "conv_dim=32" "conv_alpha=32" "preset=full" "train_norm=False" "dora_wd=True" ^
--resolution="768" ^
--save_model_as="safetensors" ^
--network_module="lycoris.kohya" ^
--shuffle_caption ^
--max_train_epochs=100 ^
--save_every_n_epochs=1 ^
--save_state_on_train_end ^
--save_precision=fp16 ^
--network_dim=32 ^
--network_alpha=32 ^
--train_batch_size=2 ^
--gradient_accumulation_steps=2 ^
--max_data_loader_n_workers=1 ^
--enable_bucket ^
--bucket_reso_steps=64 ^
--min_bucket_reso=768 ^
--max_bucket_reso=1280 ^
--mixed_precision="fp16" ^
--caption_extension=".txt" ^
--gradient_checkpointing ^
--optimizer_type="pytorch_optimizer.optimizer.amos.Amos" ^
--learning_rate=0.00001 ^
--network_train_unet_only ^
--loss_type="huber" ^
--huber_schedule="snr" ^
--huber_c=0.1 ^
--xformers ^
--full_fp16 ^
--mem_eff_attn ^
--seed=1 ^
--logging_dir="logs" ^
--log_with="tensorboard" ^
--persistent_data_loader_workers ^`
same problem, some optimizers and schedulers seem to have problems when calculating gradients
full_fp16 training is not stable, so full_bf16 is better. In addition, full_bf16 is also not recommened if VRAM is enough for standard (mixed precision without full_bf16).
@kohya-ss
full_fp16 training is not stable
Yes, I know that fp16 is unstable - it’s just that I didn’t encounter NaNs this frequently before, so I wanted to clarify whether this is normal behavior or a bug. What especially puzzled me was that --no_half_vae used to fix this "bug" before.
so full_bf16 is better.
I agree, I mostly train using Prodigy Schedule Free, which is optimized for bf16. But there’s a nuance: SDXL was trained in fp16, and many people believe it’s best to fine-tune using the same precision as the base model.
In any case, thanks for the clarification.