kohya_ss icon indicating copy to clipboard operation
kohya_ss copied to clipboard

Won't go over one epoch

Open DiggyDre opened this issue 2 months ago • 7 comments

I am trying to train an SDXL LoRA and for some reason, it will not train over one epoch. I need it to go for ten. It recognizes that I put in for 10 epoch but then when it starts training, it says that it is only going for one epoch.

DiggyDre avatar Apr 24 '24 14:04 DiggyDre

What is the train command?

TeKett avatar Apr 24 '24 18:04 TeKett

Can you share the toml file?

bmaltais avatar Apr 25 '24 15:04 bmaltais

I have the same problem and I did not see this post. I'm sharing in my case the toml file but as you can see there is set as 10 epoch. bucket_no_upscale = true bucket_reso_steps = 64 cache_latents = true cache_latents_to_disk = true caption_extension = ".txt" clip_skip = 1 dynamo_backend = "no" enable_bucket = true epoch = 10 gradient_accumulation_steps = 1 gradient_checkpointing = true huber_c = 0.1 huber_schedule = "snr" learning_rate = 0.0003 logging_dir = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/log" loss_type = "l2" lr_scheduler = "constant" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_token_length = 75 max_train_steps = 1600 min_bucket_reso = 256 mixed_precision = "fp16" multires_noise_discount = 0.3 network_alpha = 1 network_args = [] network_dim = 8 network_module = "networks.lora" noise_offset_type = "Original" optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",] optimizer_type = "Adafactor" output_dir = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/model" output_name = "last" pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5" prior_loss_weight = 1 resolution = "512,512" sample_prompts = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/model\\prompt.txt" sample_sampler = "euler_a" save_every_n_epochs = 1 save_model_as = "safetensors" save_precision = "fp16" text_encoder_lr = 0.0003 train_batch_size = 1 train_data_dir = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/img" unet_lr = 0.0003 xformers = true

eapolo avatar May 01 '24 20:05 eapolo

Try changing max_train_steps from 1600 to 0

bmaltais avatar May 01 '24 21:05 bmaltais

Hummm, I checked the code and if it is set to 0 it will be set to 1600 by as scripts… try setting it to 16000000 so your training never reach it. I will see if there is a way to make as-scripts disregard max train steps by forcing it to 0 and passing that to the sd-scripts.

bmaltais avatar May 01 '24 21:05 bmaltais

It's weird because it's like the epochs are calculated automatically because when I set to 2 to the same number of images the model was trained during 7 epochs

eapolo avatar May 01 '24 21:05 eapolo

There might be some conflict since you specify both epocs and number of steps. They do the same thing.

edit: When i checked it --max_train_epoch overrides --max_train_steps, so it should be the opposite of your problem tho.

Try printing the command since thats what accelearate is going to run. accelerate launch --num_cpu_threads_per_process=2 "C:\Train\kohya/sd-scripts/train_db.py" --lr_scheduler_type "torch.optim.lr_scheduler.MultiStepLR" --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=480 --max_bucket_reso=1560 --gradient_accumulation_steps=1 --learning_rate="1e-07" --learning_rate_te="1e-07" --lr_scheduler="constant" --lr_scheduler_args "milestones=[5000]" "gamma=0.2" --max_data_loader_n_workers="1" --resolution="512,768" --min_timestep=1 --max_timestep=1000 --max_token_length=225 --mixed_precision="bf16" --optimizer_type="lion8bit" --output_dir="C:\Train\train model" --output_name="test1" --pretrained_model_name_or_path="C:\Train\Model" --sample_every_n_steps="100" --save_every_n_steps="10000" --save_model_as=diffusers --save_precision="bf16" --shuffle_caption --train_batch_size="1" --train_data_dir="C:\Train\train model\img" --xformers --sample_sampler=euler_a --sample_prompts="C:\Train\train model\sample\prompt.txt" --noise_offset=0.1 --adaptive_noise_scale=0.01 --caption_tag_dropout_rate="0.5" --min_snr_gamma="5" --debiased_estimation_loss --max_train_steps="5000"

TeKett avatar May 02 '24 08:05 TeKett