Problems with training on Hugging Face datasets.

Open jamichss opened this issue 1 year ago • 1 comments

export HF_HOME="/data/kmei1/huggingface/" export DISK_DIR="/data/kmei1/huggingface/cache" export MODEL_DIR="stabilityai/stable-diffusion-2-1" export OUTPUT_DIR="canny_model" export DATASET_NAME="jax-diffusers-event/canny_diffusiondb" export NCCL_P2P_DISABLE=1 export CUDA_VISIBLE_DEVICES=5

python3 train_codi_flax.py
--pretrained_model_name_or_path $MODEL_DIR
--output_dir $OUTPUT_DIR
--dataset_name $DATASET_NAME
--load_from_disk
--cache_dir $DISK_DIR
--resolution 512
--learning_rate 8e-6
--train_batch_size 2
--gradient_accumulation_steps 2
--revision main
--from_pt
--mixed_precision bf16
--max_train_steps 200_000
--checkpointing_steps 10_000
--validation_steps 100
--dataloader_num_workers 8
--distill_learning_steps 20
--ema_decay 0.99995
--onestepode uncontrol
--onestepode_control_params target
--onestepode_sample_eps vprediction
--cfg_aware_distill
--distill_loss consistency_x
--distill_type conditional
--image_column original_image
--caption_column prompt
--conditioning_image transformed_image
--report_to wandb
--validation_image "figs/control_bird_canny.png"
--validation_prompt "birds" \

Hello! When I execute the training command mentioned above (and I have changed the HF_HOME and DISK_DIR to my path), I encounter a problem where the loss becomes NaN. Could you please help me understand the reason?

Jul 15 '24 09:07 jamichss

Could you please provide your loss curve visualization? The training should be stable, and it is rare to see Nan. @00757039

Jul 15 '24 16:07 MKFMIKU