Problems with training on Hugging Face datasets.
export HF_HOME="/data/kmei1/huggingface/" export DISK_DIR="/data/kmei1/huggingface/cache" export MODEL_DIR="stabilityai/stable-diffusion-2-1" export OUTPUT_DIR="canny_model" export DATASET_NAME="jax-diffusers-event/canny_diffusiondb" export NCCL_P2P_DISABLE=1 export CUDA_VISIBLE_DEVICES=5
python3 train_codi_flax.py
--pretrained_model_name_or_path $MODEL_DIR
--output_dir $OUTPUT_DIR
--dataset_name $DATASET_NAME
--load_from_disk
--cache_dir $DISK_DIR
--resolution 512
--learning_rate 8e-6
--train_batch_size 2
--gradient_accumulation_steps 2
--revision main
--from_pt
--mixed_precision bf16
--max_train_steps 200_000
--checkpointing_steps 10_000
--validation_steps 100
--dataloader_num_workers 8
--distill_learning_steps 20
--ema_decay 0.99995
--onestepode uncontrol
--onestepode_control_params target
--onestepode_sample_eps vprediction
--cfg_aware_distill
--distill_loss consistency_x
--distill_type conditional
--image_column original_image
--caption_column prompt
--conditioning_image transformed_image
--report_to wandb
--validation_image "figs/control_bird_canny.png"
--validation_prompt "birds" \
Hello! When I execute the training command mentioned above (and I have changed the HF_HOME and DISK_DIR to my path), I encounter a problem where the loss becomes NaN. Could you please help me understand the reason?
Could you please provide your loss curve visualization? The training should be stable, and it is rare to see Nan. @00757039